REMARKS 



Claims 1, 3-14 and 16-24 are currently active. 
Claims 2 and 15 are canceled. 

The Examiner has objected to the drawings because the Examiner is of the view 
that the drawings must show every feature of the invention specified in the claims. 
Specifically, the drawings must show every structural feature of the invention. The claims are 
not required to be shown in the figures. It is respectfully submitted the claims do show every 
structural feature claimed. 

The Examiner has objected to the abstract. The abstract has been amended to 
remove any fragmented sentences. 

The Examiner has rejected Claims 10 and 22 under 35 U.S.C. 112, second 
paragraph. Applicants respectfully traverse this rejection. Applicants submit the claims are 
clear and definite to one skilled in the art. In regard to the language at issue in Claim 10, it is 
clear to one skilled in the art that in the segments associated with the segment not accepted that 
are received after the segment that was not accepted was received, are ignored, is understood 




to be known to, for instance, refer to any segments of the same packet. Since one segment of 
the packet is not accepted, then any subsequent segments that are received associated with the 
segment that was not accepted, are also ignored because the packet would be in error, since 
one segment of the packet is already removed. This is also applicable to Claim 22. 

The Examiner has rejected Claims 1 and 14 as being anticipated by Yamada. 
Applicants amended Claims 1 and 14 to include the limitations of Claim 2 and 15, 
respectively. Accordingly, this rejection is obviated. 

The Examiner has rejected Claims 2 and 3 as being unpatentable over Yamada 
in view of Petersen. Applicants respectfully traverse this rejection. 

Referring to Yamada, there is disclosed a shared buffer memory switch for an 
ATM switching system and its broadcasting control method. Yamada teaches a shared buffer 
memory switch having a shared buffer memory 3 for storing cells from input ports 1 1 to 
output ports 14. There is a cell multiplexer 1 multiplexing incoming cells through input ports 
and outputing the multiplexed cells to a time division multiplex data bus 12, and a cell 
demultiplexer 7 for demultiplexing and distributing the multiplex cells on the time division 
multiplexing data bus 12 to each of the output ports. There is a shared buffer memory control 
10 for controlling operation of writing cells of a time division multiplexing data bus 12 into 
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the shared buffer memory 3 in the writing cycle of the operation, and reading cells in the 
shared buffer memory 3 out to the time division multiplex data bus 12 in a reading cycle of the 
operation. There is a FIFO memory 4. There are a plurality of FIFO memories 9. There is a 
broadcast registration table 6. There is also a bit map check 8. See column 7, lines 18-52. 

Yamada teaches that in the writing operation, the cell multiplexer 1 multiplexes 
cells coming through the input ports 11 and outputs those multiplexed cells to the time division 
multiplex data bus 12, and at the same time, the routing information which shows the 
destination of each cell is transferred to the shared buffer memory control 10 through the 
routing information path 13. The type of cell is also identified by the cell multiplexer 1, and 
this identified information is added to the cell when it is a multiplexed. The shared buffer 
memory control 10 writes each cell in a cell slot, on the time division multiplex data bus 12 
into the shared buffer memory one by one in accordance with their arrival, and all cells in a 
cell slot group for the input port are to be written in one cycle of the writing operation. 

When the cell is an ordinary cell, the shared buffer memory control 10 picks up 
one address information through the information path 16 from the address pointer queue of 
FIFO 4 which manages addresses of this idle area is available in the shared buffer memory 3 
which is indicated by the address information being picked up from the FIFO 4. The address 
of the shared buffer memory 3 in which the cell being stored is written into the address 
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pointed queue of FIFO 9 corresponding to the output port to which the cells to be routed 
through the information path 17. In the case with the cell is a broadcasting cell, the shared 
buffer memory control 10 refers to the broadcast registration table 6 and extracts the bit map 
data which is corresponding to the routing information being received the routing information 
path 13. See column 7, line 62-column 8, line 42. 

Yamada does not teach or suggest the limitation of the transferring mechanism 
transfers predetermined portions of the packet as fixed length segments as the fixed length 
segments are received. At best, Yamada teaches to receive a cell and remove the header 
information, and only then transfer the remaining segments. However, as such, the segment 
that is transferred is not as it was received, but different since there is the step of removing the 
header so that the segment that is transferred is not the same as that which was received. 

Petersen teaches a sublayer 301 is called the segmentation and reassembly 
sublayer. The segmentation and reassembly sublayer is invoked if a user data packet is so 
long that segmentation is necessary to avoid sending user data to a receiving entity and a 
minicell whose length, excluding the header, exceeds a predefined maximum length. See 
column 3, lines 23-32. 
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Petersen teaches a sending entity 401, an interconnecting link 402, and a 
receiving entity did 403 . The sending entity contains the segmentation part of the 
segmentation and reassembly sublayer and a receiving entity contains the reassembly part of 
the segmentation and reassembly sublayer. The interconnection link carries the ATM cells 
from the sending entity to the receiving entity, and the ATM cells, in turn, carry the segment 
of the user data in minicells. See column 3, lines 33-44. Unlike the known ATM protocol 
model, there is no longer a 1-to-l correspondence between each user data packet in each mini 
cell. Moreover, a single minicell can overlap no more than one ATM cell border as compared 
to the known protocol model. This is because the length of each mini cell, is limited to a 
length that is less than the ATM cell payload. See column 3, lines 45-56. 

The Examiner suggests that the deficiency in the teachings of Yamada are met 
by the teachings of Petersen. Specifically, the Examiner refers to column 3, lines 66 to 
column 4, line 17 and column 2, lines 20-28 of Petersen as providing the missing teachings in 
regard to Yamada to arrive at applicants' invention of Claim 2 (now Claim 1). 

Referring to column 3, lines 66 to column 4, line 17 of Petersen, it specifically 
teaches that in both embodiments of this new protocol model taught by Petersen, there is 
employed the same basic segmentation strategy. The user packet is divided into several 
segments. All but the last segment has a fixed and equal length. The length of the last 
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segment is adjusted so that all of the segments together are the same length as the original user 
packet. The segments are then placed into minicell pay loads. Consequently, the length of 
each minicell pay load is the same as the length of each corresponding user packet segment. 

However, as the Examiner is fully aware in regard to patent law, a reference 
must be taken as a whole, and the teachings the Examiner relies upon cannot be taken out the 
context in which they are found. As explained above, in regard to column 3, lines 28-32, 
Petersen specifically teaches that the segmentation and reassembly sublayer is invoked if a user 
data packet is so long that segmentation is necessary to avoid sending user data to a receiving 
entity in a minicell whose length, excluding the header, exceeds a predefined maximum length. 
This is important, because once again, there is clearly the step of removing the header of each 
cell before forming the segment and sending it on. Thus, the citation relied upon by the 
Examiner does not teach or suggest the limitation of transferring predetermined portions of the 
packet as fixed length segments as the fixed length segments are received. This citation, is 
completely quiet in regard to this limitation, and read in the context of the entire reference, 
there are many steps which occur once a cell is received or a packet is received before a 
segment is formed and then sent out. 

In regard to the citation on column 24, lines 20 to 28, it simply teaches that the 
transfer protocol effectively utilizes available bandwidth and reduces the speech quality 



-13- 



problems associated with transferring telecommunication data over excessively large minicells. 
See column 2, lines 20-28. Again, there is no specific teaching or suggestion whatsoever in 
regard to the limitations of applicants' claimed invention. 

As explained above, Yamada also does not arrive at this limitation because 
Yamada specifically also teaches the step of removing the header before the segment is 
formed. That means the segment that is transferred is different than the cell that is received, 
and this limitation is also not met. For this reason alone, the combination of Yamada and 
Petersen fail to arrive at the limitations of Claim 1, as now amended, to include the limitations 
of Claim 2. 

Furthermore, there must be some teaching or suggestion in the references 
themselves to combine the teachings the Examiner is relying upon to arrive at applicants' 
claimed invention. Here, there is no teaching or suggestion whatsoever to combine these two 
references. In fact, the only motivation is from the claims of applicants themselves,. 
However, this is the use of hindsight and is not patent law. The Examiner cannot use the 
elements of a claim of applicants as a road map to find the different elements and limitations in 
the prior art, and having found the different elements and limitations in the prior art, conclude 
that applicants' claimed invention is arrived at. No one skilled in the art would attempt to 
combine Petersen that has to do with a novel segmentation and reassembly layer that provides 
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a new AALm protocol model, see column 3, line 23 with the teachings of Yamada that it is 
directed to a shared buffer memory switch. 

In addition, these two references cannot be combined, or if they could even be 
combined, would require significant experimentation and research to attempt to modify them 
so that somehow or other the operation of the novel segmentation and reassembly layer model 
taught by Petersen could somehow or other be use in regard to the shared buffer memory 
switch taught by Yamada. However, this very requirement of significant experimentation and 
research specifically supports the obviousness of amended Claim 1 of applicants. 
Accordingly, Claim 1 is not obvious from Yamada and Petersen, and is patentable over the 
applied art of record. Claim 3 is dependent to parent Claim 1 is patentable for the reasons 
Claim 1 is patentable. 

The Examiner has rejected Claims 4-8 as being unpatentable over Yamada and 
Petersen and Cisneros. Applicants respectfully traverse this rejection. Cisneros does not add 
anything in relevant part to the teachings of Yamada and Petersen to arrive at Claim 1 of 
applicants. Claims 4-8 are dependent to parent Claim 1 and are patentable for the reason 
Claim 1 is patentable. 
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The Examiner has rejected Claims 9-13 as being unpatentable over Yamada, 
Petersen, Cisneros and Calamvokis. Applicants respectfully traverse this rejection. 
Calamvokis does not add anything in relevant part to the teachings of Yamada and Petersen to 
arrive at Claim 1 of applicants. Claims 9-13 are dependent to parent Claim 1 and are 
patentable for the reasons Claim 1 is patentable. 

The Examiner has rejected Claims 15 and 16 as being unpatentable over 
Yamada in view of Petersen. Claims 15 and 16, now Claims 14 and 16 are patentable for the 
reasons newly amended Claim 1 is patentable over Yamada and Petersen. 

The Examiner has rejected Claims 17-20 as being unpatentable over Yamada 
and Petersen and Cisneros. Claims 17-20 are patentable for the same reasons that Claim 4 is 
patentable over the applied art record. 

The Examiner has rejected Claims 21-24 as being unpatentable over Yamada, 
Petersen, Cisneros and Calamvokis. Claims 21-24 are patentable over the applied art record 
for the same reasons that Claims 9-13 are patentable over the applied art record. 

A substitute clean specification and marked up original specification are 
enclosed. The marked original specification has deletions bracketed and additions underlined. 
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No new matter has been added. The information deleted is unnecessary for enablement and is 
considered superfluous information that applicant desires not to have published. 

In view of the foregoing amendments and remarks, it is respectfully requested 
that the outstanding rejections and objections to this application be reconsidered and 
withdrawn, and Claims 1, 3-14 and 16-24, now in this application be allowed. 



Respectfully submitted, 



CERTIFICATE OF MAILING 



FAN ZHOU, ET AL. 




Attorney for Applicants 
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LONG PACKET HANDLING 



FIELD OF THE INVENTION 

The present invention is related to transferring a packet 
to a memory. More specif ically, the present invention is related 
to transferring a packet to a memory controller of a fabric from an 
aggregator in fixed length segments followed by a sif 
segment of any length. 

FEB 2 3 2004 

background of the invention Technology Center 2600 

Ordinarily, an entire packet is transferred at once, 
occupying an interface for as long as it takes to transfer the 
packet. Lengthy packets can monopolize an interface for relatively 
long periods of time. This can delay other packets which share the 
interface, affecting their QoS. It also increases the amount of 
buffer required at the input to shared interfaces to smooth out 
bursts caused by lengthy packets. 



Instead of transferring an entire lengthy packet at once, 
it is transferred in fixed length segments followed by a single 
final segment of any length, termed Long Packet Handling. This 
puts a small bound on the maximum period any one packet can occupy 
an interface, reducing the effect it has on the QoS of packets 
belonging to other connections. This also reduces store-and-f orward 
requirements because the Aggregator can begin forwarding a packet 
as soon as it receives a segment instead of waiting until it 
receives the entire packet. This simple form of segmentation and 
reassembly requires only as many contexts as there are sources. 
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SUMMARY OF THE INVENTION 

The present invention pertains to a switch for switching 
packets from a plurality of sources. The switch comprises a memory 
in which portions of packets are stored. The switch comprises a 
5 transferring mechanism which transfers predetermined portions of a 
packet to the memory as the predetermined portions are received. 

The present invention pertains to a method for switching 
packets. The method comprises the steps of receiving portions of 
a packet at a transferring mechanism of a switch. Then there is 
10 the step of transferring predetermined portions of the packet to a 
memory of the switch as the predetermined portions are received at 
the transferring mechanism. 

BRIEF DESCRIPTION OF THE DRAWINGS 

In the accompanying drawings, the preferred embodiment of 
15 the invention and preferred methods of practicing the invention are 
illustrated in which: 

Figure 1 is a schematic representation of packet striping 
in the switch of the present invention. 

Figure 2 is a schematic representation of an 0C 48 port 

20 card- 



Figure 3 is a schematic representation of a concatenated 
network blade. 
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Figure 4 is a schematic representation regarding the 
connectivity of the fabric ASICs. 

Figure 5 is a schematic representation of a 32 - bit cell 

transfer . 

5 Figure 6 is a schematic representation regarding 

back pressure. 

Figure 7 is a schematic representation of a 32 bit packet 

transferred using external connection number bus. 

Figure 0 — is a schematic representation of a — G4 -- bit cell 

10 transferred. 

Figure 9 is a schematic representation of a G4-bit packet 

transfer . 

Figure 10 is a schematic representation of ATM cell flow 

in the switch. 

15 Figure [[11]] 5 is a schematic representation of sync 

pulse distribution . 

Figure — 3r2 — is — a — schematic — representation — regarding — the 
write cycle. 

Figure — 3r3 — is — a — schematic — representation — erf — t+re — read 

20 cycle . 
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Figure — W — is a — schematic representation of the — striper 

AOIC architecture. 

Figure 15 is a schematic presentation of the aggregator 

AOIC architecture . 

5 Figure — 3r6 — ars — a — schematic — representation — erf — a — memory 

controller AOIC architecture. 

Figure 17 is a schematic representation of the wide cache 

line shared memory architecture. 

Figure — 3r8 — is a schematic representation of a — separator 

10 AOIC architecture. 

Figure 19 is a schematic representation of an unstriper 

AOIC architecture . 

Figure [[20]] 6 is a schematic representation regarding 
the relationship between transmit and receive sequence counters for 
15 the separator and unstriper, respectively. 

Figure — zM — is — a — schematic — representation — of — a — receive 
synchronizer . 

Figure [[22]] 2 is a schematic representation of a switch 
of the present invention. 

20 Figure [[23]] 8 is a schematic representation of how the 

prior art transfers packets. 
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Figure [[24]] 9 is a schematic representation of how the 
present invention transfers packets. 

DETAILED DESCRIPTION 

Referring now to the drawings wherein like reference 
5 numerals refer to similar or identical parts throughout the several 
views, and more specifically to figure [[22]] 2 thereof , there is 
shown a switch 10 for switching packets from a plurality of sources 
12. The switch 10 comprises a memory 14 in which portions of 
packets are stored. The switch 10 comprises a transferring 
10 mechanism 16 which transfers predetermined portions of a packet to 
the memory 14 as the predetermined portions are received. 

I*. Preferably, the transferring mechanism 16 transfers 

predetermined portions of the packet as fixed length segments as 
the fixed length segments are received followed by a single final 

15 segment of any length wherein the packet is transferred to the 
memory 14. The transferring mechanism 16 preferably transfers 
fixed length segments of different packets interleaved among each 
other as they are received to the memory 14. In the memory 14, the 
segments are stored with other segments of the same packet. 

20 Preferably, the transferring mechanism 16 includes an aggregator 18 
which receives portions of packets from the plurality of sources 
12. 

The memory 14 preferably includes a memory controller 20. 
Preferably, the aggregator 18 uses a TDM to multiplex segments of 
25 packets from different sources 12 to the memory controller 20. The 
aggregator 18 preferably places an identifier with each segment 




-6- 

identifying from which source the segments came from. Only long or 
lengthy packets need the identifier. Preferably, the memory 
controller 20 includes per source queues 22, and stores each 
segment in a corresponding per source queue 22 based on the 
5 identifier of the segment. 

The memory controller 20 preferably includes per 
destination queues 24, and once all segments for a packet are 
received at a per source queue 22, all the segments of the packet 
are changed to a corresponding per destination queue 24. That is, 

10 preferably, the physical location of where the segments are stored 
in the memory 14 does not change, but the designation by the memory 
controller of the respective per source queue 22 is changed to a 
per destination queue 24. Preferably, the memory controller 20 has 
acceptance criteria for accepting segments, and if the segment is 

15 not accepted, then all previously received segments associated with 
the segment not accepted are purged from the per source queue 22 
and any segments associated with the segment not accepted that are 
received after the segment that was not accepted was received, are 
ignored. 

20 The switch 10 preferably includes a fabric 26 in which 

the aggregator 18 and the memory controller 20 are disposed, and 
includes a separator 28 disposed in the fabric 26 connected to the 
aggregator 18. Preferably, the switch 10 includes a port card 30 
having a striper 32 which sends portions of packets to the 

25 aggregator 18, and an unstriper 34 which receives portions of 
packets from the separator 28. The memory controller 20 includes 
a shared memory 36, and the destination queues 24 and the source 
queues are part of the shared memory 36. 
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The present invention pertains to a method for switching 
packets. The method comprises the steps of receiving portions of 
a packet at a transferring mechanism 16 of a switch 10. Then there 
is the step of transferring predetermined portions of the packet to 
5 a memory 14 of the switch 10 as the predetermined portions are 
received at the transferring mechanism 16. 

Preferably, the transferring step includes the step of 
transferring the predetermined portions as fixed length segments as 
the fixed length segments are received at the transferring 

10 mechanism 16 followed by a single final segment of any length 
wherein the packet is transferred to the memory 14. The 
transferring step preferably includes the step of transferring 
fixed length segments of different packets as they are received 
interleaved among each other to the memory 14. Preferably, the 

15 receiving step includes the step of receiving portions of packets 
from different sources 12 at an aggregator 18 of the transferring 
mechanism 16 disposed in a fabric 26 of the switch 10. 

The transferring step preferably includes the step of 
multiplexing with the aggregator 18 segments of packets from 

20 different sources 12 to the memory controller 20. Preferably, 
before the transferring step there is the step of placing by the 
aggregator 18 an identifier with each segment identifying from 
which source the segment came from. After the transferring step, 
there is preferably the step of storing each segment in a 

25 corresponding per source queue 22 of the memory controller 20 based 
on the identifier of the segment. Preferably, after the storing 
step there is the step of changing all segments of the packet in 
the source queue to a corresponding per destination queue 24 of the 
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memory controller 20 once all the segments of the packet are 
received at the per source queue 22. 

The receiving step preferably includes the steps of 
purging all previously received segments associated with an 
5 unaccepted segment that does not meet acceptance criteria for 
accepting a segment of the memory controller 20, and ignoring all 
segments associated with the unaccepted segment received at the 
memory controller 20 after the unaccepted segment is received at 
the memory controller 20. Preferably, the receiving step includes 
10 the step of receiving portions of packets from different sources 12 
at the aggregator 18 of the transferring mechanism 16 disposed in 
the fabric 26 of the switch 10 from a striper 32 of a port card 30 
of the switch 10. 

After the moving step, there is preferably the step of 
15 sending portions of packets from the memory controller 20 with a 
separator 28 of the fabric 26 to an unstriper 34 of the port card 
30. 

In the operation of the invention, the aggregator 18 
begins receiving a packet from the Striper 32. It begins 

20 transferring it to the Memory controller 20 once it finishes, or it 
reaches the Long Packet Segment length - 600 bits per Memory 
controller 20, which is equivalent to 7200 bits per fabric 26. If 
it is longer than 7200 bit per fabric 26, the aggregator 18 
segments the packet into however many 7200 bit segments are 

25 required, followed by a final segment which is less than or equal 
to 7200 bits. The aggregator 18 uses TDM to multiplex packets from 
up to 24 sources 12 onto a single bus. This shared bus is one 
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place where segmenting long packets helps QoS. The Memory 
controller 20 uses TDM to multiplex data from 8 aggregators onto a 
single bus, another place where processing a long packet in one 
continuous burst would impact QoS. 

5 A similar approach would be to have the source segment 

and the destination reassemble, keeping the segments that traverse 
the fabric 26 relatively short. This would improve QoS through the 
fabric 26 in a similar manner, but would require every destination 
to have per-source, per-priority, unicast/multicast reassembly 
10 contexts. There would be a greater number of contexts, and they 
would exist in a much greater number of locations. 

The aggregator 18 indicates to the memory controllers 20 
which source the packet is coming from. Since every source can 
only produce one packet at a time, the memory controllers 20 only 

15 need to keep track of one long packet context per source. The 
memory controllers 20 store each segment in a per-source queue. 
Once the entire packet is accepted, it is linked into the queue for 
the destination to which it will go. The long packet is either 
dropped as a whole, or enqueued as a whole. If at any time it does 

20 not meet acceptance criteria, the current segment is not enqueued. 
Any previous segments are purged, and any future segments are 
ignored. This is an added benefit over source 
segmentation/destination reassembly. The fabric 26 would not have 
knowledge of which segments belonged to which packets and might 

25 waste resources on packets that would be dropped at the 
destination . 
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The second benefit of segmenting long packets is reduced 
buffering requirements. Where n sources 12 of bandwidth m are 
multiplexed onto a single interface of bandwidth n*m, the required 
buffer depth for each source is approximately 2 * p, where p is the 
5 maximum transfer length per source. By segmenting long packets in 
this case, p is reduced from -64k bytes to ~lk bytes for the 
aggregator 18, and 1/12 those values for the memory controllers 20. 

Several error handling mechanisms are part of Long Packet 
Handling. The aggregator 18 enforces a maximum packet length to 
10 prevent a single packet from consuming all resources. It also 
enforces a maximum transfer time in case a source does not complete 
a packet. The source is allowed to pause the interface during a 
packet transfer, but the maximum transfer time causes the packet to 
be aborted in case of abnormal, excessive pause. 

15 Figures 23 and 2 4 8 and 9 demonstrate the reduced buffer 

requirements and better delay performance of the TDM structure 
gained by using long packet handling. 

The switch uses RAID techniques to increase overall 
switch bandwidth while minimizing individual fabric bandwidth. In 
20 the switch architecture, all data is distributed evenly across all 
fabrics so the switch adds bandwidth by adding fabrics and the 
fabric need not increase its bandwidth capacity as the switch 
increases bandwidth capacity. 

Each fabric provides 40G of switching bandwidth and the 
25 system supports 1, 2, 3, 4, 6, or 12 fabrics, exclusive of the 
redundant/spare fabric. In other words, the switch can be a 40G, 
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80G, 120G, 160G, 240G, or 480G switch depending on how many fabrics 
are installed. 

A portcard provides 10G of port bandwidth. For every 4 
portcards, there needs to be 1 fabric. The switch architecture 
5 does not support arbitrary installations of portcards and fabrics. 



whole, the switch takes a "receiver make right" approach where the 
egress path on ATM blades must segment frames to cells and the 
egress path on frame blades must perform reassembly of cells into 



There are currently eight switch ASICs that are used in 
the switch: 



The fabric ASICs support both cells and packets. As a 



10 packets. 



1. 



Striper - The Striper resides on the portcard and 
SCP-IM. It formats the data into a 12 bit data 



15 



stream, appends a checkword, splits the data stream 
across the N, non- spare fabrics in the system, 
generates a parity stripe of width equal to the 
stripes going to the other fabric, and sends the 
N+l data streams out to the backplane. 



20 



2. 



Unstriper - The Unstriper is the other portcard 



ASIC in the the switch architecture. It receives 



data stripes from all the fabrics in the system. It 
then reconstructs the original data stream using 
the checkword and parity stripe to perform error 



25 



detection and correction. 
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3. Aggregator - The Aggregator takes the data streams 
and routewords from the Stripers and multiplexes 
them into a single input stream to the Memory 
Controller . 

Memory Controller - The Memory controller 
implements the queueing and dequeueing mechanisms 
of the switch. This includes the proprietary wide 
memory interface to achieve the simultaneous en- 
/de-queueing of multiple cells of data per clock 
cycle. The dequeueing side of the Memory Controller 
runs at 80Gbps compared to 40Gbps in order to make 
the bulk of the queueing and shaping of connections 
occur on the portcards. 

5. Separator - The Separator implements the inverse 
15 operation of the Aggregator. The data stream from 

the Memory Controller is demultiplexed into 
multiple streams of data and forwarded to the 
appropriate Unstriper ASIC. Included in the 
interface to the Unstriper is a queue and flow 
20 control handshaking. 

&t Trident Trident is, — strictly speaking, not one of 

the AOICs. — It is actually one - half of the Poseidon 
chipset . — Trident will be used to implement the ATM 
portcards within the switch. 



10 
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Vortex Vortex — i-s — the — partner — fro — Trident — im — frfre 

Poseidon — chipset . — Vortex — irs — frhe — ingress — AGIC — gm-d 
Trident the egress device. — Together, — the two chips 
implement — a — 2 . 5Gbps — ingress, — 5Gbps — egress — system 
5 capable of supporting up to OC - 40c ports. 

EH Reassembler 54 a re — Reassembler — AOIC — is — frfre — frame 

blade equivalent to Trident. — It will be capable of 

taking cell streams from the Unstriper and 

converting them into frames. 

There are 3 different views one can take of the 
connections between the fabric: physical, logical, and "active." 
Physically, the connections between the portcards and the fabrics 
are all gigabit speed differential pair serial links. This is 
strictly an implementation issue to reduce the number of signals 
going over the backplane. The "active" perspective looks at a 
single switch configuration, or it may be thought of as a snapshot 
of how data is being processed at a given moment. The interface 
between the fabric ASIC on the portcards and the fabrics is 
effectively 12 bits wide. Those 12 bits are evenly distributed 
("striped") across 1, 2, 3, 4, 6, or 12 fabrics based on how the 
fabric ASICs are configured. The "active" perspective refers to the 
number of bits being processed by each fabric in the current 
configuration which is exactly 12 divided by the number of fabrics. 

The logical perspective can be viewed as the union or max 
25 function of all the possible active configurations. Fabric slot #1 
can, depending on configuration, be processing 12, 6, 4, 3, 2, or 
1 bits of the data from a single Striper and is therefore drawn 
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with a 12 bit bus. In contrast, fabric slot #3 can only be used to 
process 4, 3, 2, or 1 bits from a single Striper and is therefore 
drawn with a 4 bit bus. 

Unlike previous switches, the switch really doesn't have 
5 a concept of a software controllable fabric redundancy mode. The 
fabric ASICs implement N+l redundancy without any intervention as 
long as the spare fabric is installed. 

As far as what does it provide; N+l redundancy means that 
the hardware will automatically detect and correct a single failure 
10 without the loss of any data. 

The way the redundancy works is fairly simple, but to 
make it even simpler to understand a specific case of a 120G switch 
is used which has 3 fabrics (A, B, and C) plus a spare (S) . The 
Striper takes the 12 bit bus and first generates a checkword which 

15 gets appended to the data unit (cell or frame) . The data unit and 
checkword are then split into a 4-bit-per-clock-cycle data stripe 
for each of the A, B, and C fabrics (A 3 A 2 A 1 A 0 , B 3 B 2 B 1 B 0 , and C 3 C 2 C 1 C 0 ) . 
These stripes are then used to produce the stripe for the spare 
fabric S 3 S 2 S 1 S 0 where S n = A n XOR B n XOR C n and these 4 stripes are 

20 sent to their corresponding fabrics. On the other side of the 
fabrics, the Unstriper receives 4 4-bit stripes from A, B, C, and 
S. All possible combinations of 3 fabrics (ABC, ABS, ASC, and SBC) 
are then used to reconstruct a "tentative" 12-bit data stream. A 
checkword is then calculated for each of the 4 tentative streams 

25 and the calculated checkword compared to the checkword at the end 
of the data unit. If no error occurred in transit, then all 4 
streams will have checkword matches and the ABC stream will be 
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forwarded to the Unstriper output. If a (single) error occurred, 
only one checkword match will exist and the stream with the match 
will be forwarded off chip and the Unstriper will identify the 
faulty fabric stripe. 

5 For different switch configurations, i.e. 1, 2, 4, 6, or 

12 fabrics, the algorithm is the same but the stripe width changes. 

If 2 fabrics fail, all data running through the switch 
will almost certainly be corrupted. 

There are basically two options, both requiring that the 

10 defective fabrics be known through some means. Unfortunately, — irr 

a double failure system, — the hardware that detects and identifies 
a — failed — fabric — will — only — be — able — to — identify — the — fabric — that 
failed — first — (-rf — there — wa^s — one) . — Identifying — both — the — failed 
fabrics — may — only be — possible — through — a — trial - arid error — approach 
15 unless — the — switch — software — and/ or — switch diagnostics — can develop 
tests to identify the second failure. — 

The recommended approach would be to shut down the switch 

and install as many good fabrics as possible beginning with slot 1. 
This allows the maximum bandwidth and redundancy be available given 
2 0 the functional hardware available. 

54°re — other — option — i-s — to — have — the switch — software 

reconfigure the — switch to use fewer — fabrics . — This is an inferior 
solution for two reasons : 




-16- 

3n ft — ectn — never — provide — more — bandwidth — than — the- 

recommended approach . 

zh ft — requires — substantial — thought — smd — understanding 

of — the — switch — by — the — user — in — order — to — determine 
5 what is the maximum operational configuration. 

Dasically , — the user must start at fabric slot 1 and count 

the — number — erf — operational — fabrics . ff — the — spare — fabric — is- 

operational , — then — it — may — be — used — to — "cover" — f-crr — the — first — non 
operational fabrics . 

Exam p le — #1-: — A r e dundant — 240G — s w itch — (G 1 -1 — fabrics ^ — has — suff e r e d 
fabri c failur e s in sl o ts 3 and 4. — Starting with slot 1 th e r e ar e 2 
operational fabrics and the spare is available to cover for slot 3. 
This switch can be reconfigured to a 120G non redundant — switch or 
an QOG redundant switch. Note than by swapping fabric 5 and 0 into 
slots 3 and 4, — this switch could be a 1G0G redundant switch. 

Exam p l e — #SH — A r e dundant — 400G — swit c h — suff e rs — f abric — failur e s — ±rt 
sl o ts 1 and th e s p ar e . — Start swapping fabrics. — Slot 1 is d e ad and 
the spare is not available to cover for it. — This is the worst case 
scenario . 

2 0 Exam p l e — #3-: — A — r e dundant — 400G — switch — suff e rs — fabri c — failur e s — in 
sl o ts 2 and 10 . — Th e r e is on e functional fabric counting from slot 
1 or 9 if the spare is used to cover for slot 2. — This switch can be 
configured either as 40G redundant or 240G non redundant. Note that 
fabrics 7,0, — and 9 do not help since the only legal configuration 

2 5 after G fabrics is all 12. 
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The fabric slots are numbered and must be populated in 
ascending order. Also, the spare fabric is a specific slot so 
populating fabric slots 1, 2, 3, and 4 is different than populating 
fabric slots 1, 2, 3, and the spare. The former is a 160G switch 
5 without redundancy and the latter is 120G with redundancy. 

Firstly, the ASICs are constructed and the backplane 
connected such that the use of a certain portcard slots requires 
there to be at least a certain minimum number of fabrics installed, 
not including the spare. This relationship is shown in Table 0. 

10 In addition, the APS redundancy within the switch is 

limited to specifically paired portcards. Portcards 1 and 2 are 
paired, 3 and 4 are paired, and so on through portcards 47 and 48. 
This means that if APS redundancy is required, the paired slots 
must be populated together. 

15 To give a simple example, take a configuration with 2 

portcards and only 1 fabric. If the user does not want to use APS 
redundancy, then the 2 portcards can be installed in any two of 
portcard slots 1 through 4. If APS redundancy is desired, then the 
two portcards must be installed either in slots 1 and 2 or slots 3 

20 and 4. 



Portcard 


Minimum 


Slot 


# of 




Fabrics 


1-4 


1 


5-8 


2 


9-12 


3 


13-16 


4 


17-24 


6 


25-48 


12 
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Table 0: Fabric Requirements for Portcard Slot Usage 

To add capacity, add the new fabric (s) , wait for the 
switch to recognize the change and reconfigure the system to stripe 
across the new number of fabrics. Install the new portcards. 

5 Note that it is not technically necessary to have the 

full 4 portcards per fabric. The switch will work properly with 3 
fabrics installed and a single portcard in slot 12. This isn't cost 
efficient but it will work. 

To remove capacity, reverse the adding capacity 

10 procedure. 

If the switch is oversubscribed, i.e. install 8 portcards 
and only one fabric. 

It should only come about as the result of improperly 
upgrading the switch or a system failure of some sort. The reality 

15 is that one of two things will occur, depending on how this 
situation arises. If the switch is configured as a 40G switch and 
the portcards are added before the fabric, then the 5 th through 8 th 
portcards will be dead. If the switch is configured as 80G non- 
redundant switch and the second fabric fails or is removed then all 

20 data through the switch will be corrupted (assuming the spare 
fabric is not installed) . And just to be complete, if 8 portcards 
were installed in an 80G redundant switch and the second fabric 
failed or was removed, then the switch would continue to operate 
normally with the spare covering for the failed/removed fabric. 
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The switch includes the following features. 

Ocales from 40Gbps to 400Gbpa (40, 00, 120, 1G0, 240, 400 

GD/sec are the supported configurations) . 

Switches ATM cells and variable length packets 

5 N 1 1 fabric redundancy with error detection and recovery 

supported in the ASIC chipset. — 

Native APS support 

Support up to 19GK cell shared memory, — 921GK unicast and 

G4K multicast nnections . 

10 Support 2x port speed for fabric dequeueing — (2.5 GD/sec 

±rr, — 5 GD/sec out for each OC40 port) . 

Supports both OC40c ports and OC192c ports. 

Provides — port /priority — queuing — similar — to — past — switch 
fabrics . — Four priorities are provided for 40 — 120 GD/sec 
15 switches , — 2 priorities/port for 240 GD/sec switches and 

1 priority for 400 GD/sec switches. 



ASICs utilize 250 MHz IISTL point to point busses between 
fabric ASICs and interface with the backplane using stan - 
dard GDit transceivers. 
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Interface — to — port — cards — chips — tree — 00 125 — M¥t? — LVTTL 

signals . 

Support output port supplied back pressure. 

Fhe — significant — architectural — difference — between — the 

5 switch — arrd — past — switches — is — that — incoming — traffic — w — routed — to 
multiple — switch — fabrics . — Each — fabric — ±-s — designed — to — enqueue — 46- 
GD/sec of data and dequeue 00 GD/sec of data. — As data comes into 
the switch, — it is broken up on a bit by bit basis and part of each 
packet is sent to each fabric in the box. The fabrics will all make 

10 the same enqueuing and drop decisions, — and all schedule fragments 
of a packet /cell at the same time. Each fabric sends its portion of 
the — packet — or — cell — to — the — output — port — card which — reassembles — the 
fragment — into the complete cell/packet which is then passed to a 
shared memory AGIC for per port storage and scheduling. The XOR of 

15 the — data — sent — to — each — fabric — w — sent — to — a — spare — fabric. — in — the- 
event of a — fabric — failure, — that — fabrics — data can be — recovered by 
utilizing — the — good — data — bits — and — the — parity — fabric — bits — to 

recalculate — srrry — fabrics — data . ¥he — striping — of — data — to — fabrics 

happens on the basis of 40 bit chunks. — This allows the switch to 

2 0 support — 1,2,3,4,0 and 12 — fabrics . 

Five — AGICs — build — the — switching — functionality — for — the 

switch . — These AGICs are described briefly below. 



TABLE 1 : The switch AGIC3 





Function 


Striper 




Takes incoming cell from Vortex (or OC192e equivalent) or from POS input stage and breaks the data up 


into the appropriate ehunks to go to each fabric, calculates the pai ity for the spare fabric, concatenates a 


checksum onto the packet, separates the routcword and data into separate routeword and data busses which 
run across the backplane. 
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Aggregator 


Receives separate data and routeword busses from multiple stripers. Converts from the reasonably slim 
dedicated stripcro Aggregator busses to a wide shared bus to the memory controllers. 


Controllers 


Actually perform the qucueing of data for the fabrics. Queues the eell into one of 200 queues (192 UC queues 
4 MC queues and 4 control port queues). — All drops whieh occur in the chipset occur here. 






Separator 


Combines traffic from multiple memory controllers to one fabrie output. Provides rate control of the stream 


of data leaving the fabric for each QC48 or OC192e port. 


Unstripcr 


Receives data from multiple separators. Combines traffic and error checks the received data. Detects errors on 


any fabric and attempts to reconstruct the good data. Passes the data to the output memory controller. If the 
3tripcr is on an ATM blade and the data is a packet, it is segmented before passing onto the ATM controller. 



Figure 1 shows packet striping in the switch. 

The chipset supports ATM and POS port cards in both OC4 8 
and OC192c configurations. OC48 port cards interface to the 
switching fabrics with four separate OC48 flows. OC192 port cards 

10 logically combine the 4 channels into a 10G stream. The ingress 
side of a port card does not perform traffic conversions for 
traffic changing between ATM cells and packets. Whichever form of 
traffic is received is sent to the switch fabrics. The switch 
fabrics will mix packets and cells and then dequeue a mix of 

15 packets and cells to the egress side of a port card. 

The egress side of the port is responsible for converting 
the traffic to the appropriate format for the output port. This 
convention is referred to in the context of the switch as "receiver 
makes right". A cell blade is responsible for segmentation of 

20 packets and a cell blade is responsible for reassembly of cells 
into packets. To support fabric speed-up, the egress side of the 
port card supports a link bandwidth equal to twice the inbound side 
of the port card. — For each OC40 interface, — the unstriper supports 
a bandwidth of GGD/sec and for each OC192 interface, a bandwidth of 

2 5 24 GD/sec — (combined routeword i — data) . 
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The block diagram for a Poseidon-based ATM port card is 
shown as in Figure 2. Each 2 . 5G channel consists of 4 ASICs: Vortex 
Inbound TM and striper ASIC at the inbound side and unstriper ASIC 
and Trident outbound TM ASIC at the outbound side. 

5 At the inbound side, the Vortex ASIC aggregates 1 OC-48c 

or 4 0C-12c interfaces are aggregated . Each vortex sends a 2 . 5G 
cell stream into a dedicated striper ASIC (using the BIB bus, as 
described below) . The striper converts the vortex — supplied 
routeword into two pieces. A portion of the routeword is passed to 
10 the fabric to determine the output port(s) for the cell. The 
entire routeword is also passed on the data portion of the bus as 
a routeword for use by the outbound memory controller. The first 
routeword is termed the "fabric routeword". The routeword for the 
outbound memory controller is the "egress routeword". 

15 At the outbound side, the unstriper ASIC in each channel 

takes traffic from each of the port cards, error checks and correct 
the data and then sends correct packets out on its output bus. The 
unstriper uses the data from the spare fabric and the checksum 
inserted by the striper to detect and correct data corrupt ion. -¥tr& 

2 0 5Gbps — traffic — i-s — then — sent — to — the — Trident — ASIC — erf — t+re — Poseidon 
chipset. The Trident ABIC stores the incoming cells based on per VC 
queues and sends them out to OC 12e/0C 40c interfaces at aggregated 
speed of 2 . DGbps . 

For the POG int e rfaces, the striper ASIC input bus speeds 
2 5 trp — to — 3 . 2Gbps — to — handle — P^S — overhead. — 34re — outbound — side, — Wre 
unstriper — talks — to — a — reassembly — stage — which — i-s — currently — being 
defined. 
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Figure 2 shows an OC48 Port Card. 

The OC192 port card supports a single 10G stream to the 
fabric and between a 10G and 20G egress stream. This board also 
uses 4 stripers and 4 unstriper, but the 4 chips operate in 
5 parallel on a wider data bus. The data sent to each fabric is 
identical for both OC48 and OC192 ports so data can flow between 
the port types without needing special conversion functions. 

Figure 3 shows a 10G concatenated network blade. 

Each 40G switch fabric enqueues up to 40Gbps cells/frames 
and dequeue them at 80Gbps. This 2X speed-up reduces the amount of 
traffic buffered at the fabric and lets the outbound ASIC digest 
bursts of traffic well above line rate. A switch fabric consists of 
three kinds of ASICs: aggregators, memory controllers, and 
separators. Nine aggregator ASICs receive 40Gbps of traffic from up 
to 48 network blades and the control port. The aggregator ASICs 
combine the fabric route word and payload into a single data stream 
and TDM between its sources and places the resulting data on a wide 
output bus. An additional control bus (destid) is used to control 
how the memory controllers enqueue the data. The data stream from 
each aggregator ASIC then bit sliced into 12 memory controllers. 

The memory controller receives up to 16 cells/frames 
every 250MHz clock cycle. Each of 12 ASICs stores 1/12 of the 
aggregated data streams. It then stores the incoming data based on 
control information received on the destid bus. Storage of data is 
25 simplified in the memory controller to be relatively unaware of 
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packet boundaries (cache line concept) . All 12 ASICs dequeue the 
stored cells simultaneously at aggregated speed of 80Gbps. 

Nine separator ASICs perform the reverse function of the 
aggregator ASICs. Each separator receives data from all 12 memory 
5 controllers and decodes the routewords embedded in the data streams 
by the aggregator to find packet boundaries. Each separator ASIC 
then sends the data to up to 24 different unstripers depending on 
the exact destination indicated by the memory controller as data 
was being passed to the separator. 

10 The dequeue process is back-pressure driven. If 

back-pressure is applied to the unstriper, that back-pressure is 
communicated back to the separator. The separator and memory 
controllers also have a back-pressure mechanism which controls when 
a memory controller can dequeue traffic to an output port. 

15 In order to support OC48 and OC192 efficiently in the 

chipset, the 4 OC48 ports from one port card are always routed to 
the same aggregator and from the same separator (the port 
connections for the aggregator & Sep are always symmetric). The 
table below shows the port connections for the aggregator & sep on 

20 each fabric for the switch configurations. Since each aggregator 
is accepting traffic from 10G of ports, the addition of 40G of 
switch capacity only adds ports to 4 aggregators. This leads to a 
differing port connection pattern for the first four aggregators 
from the second 4 (and also the corresponding separators) . 
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TABLE 2: Agg/Sep port connections 



Switch Size Agg 1 Agg 2 Agg 3 Agg 4 Agg 5 Agg 6 Agg 7 Agg 8 

40 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16 

SO 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16 17,18,19,20 21,22,23,24 25,26,27,28 29,30,31,32 

120 1,2,3,4 5,6,7,8 9,10,11,12, 13,14,15,16, 17,18,19,20 21,22,23,24 25,26,27,28 29,30,31,32 

33,34,35,36 37,38,39,40 41,42,43,44 45,46,47,48 

160 1,2,3,4 5,6,7,8 9,10,11,12, 13,14,15,16, 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32, 

33,34,35,36 37,38,39,40 41,42,43,44 45,46,47,48 49,50,51, 52 53,54,55,56 57,58,59, 60 61,62,63,64 



Figure 4 shows the connectivity of the fabric ASICs. 

The external interfaces of the switches are the Input Bus 
(BIB) between the striper ASIC and the ingress blade ASIC such as 
10 Vortex and the Output Bus (BOB) between the unstriper ASIC and the 
egress blade ASIC such as Trident. 

Two variations of routewords are — supported. Wre — first 

option — uses — one — 32 — bit — routeword which — is — passed — to — the — egress 
board as the egress routeword and has fields extracted to form the 

15 fabric routeword. The second option allows the striper to accept 

both a — fabric — routeword — (which happens — on a dedicated — routeword 
bus ) — and an egress routeword — (which is received on the data bus) . 
J Fhe — second option — is more — flexible on co nnection — space — usage — and 
expansion since that allows all 32 bits of the routeword to be used 

2 0 to identify connections on switch egress. 



To maintain compatibility with Vortex, — bit — Sr4 — is — still 

maintained as — the multicast bit . "Phe — incoming — routeword has — the 

following format . — 
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TADLE 3 



32 bit DID/DOD rout e word format 



bi t 30 : 25 



ijirw 



bi t 23:0 



Connection ID(2Q:28) & 



Multicas t B it 



C o nnection ID (27:20) & c o nnec t ion ID (15:0) 



Con n ec t ion ID(19:16) 



10 



The 2G bit conn ID in the routeword is set to 

MC bit Sl Connection ID — (29:5) — for UC connections which are 

not special routeword values 
MC bit & Connection ID — (24 : 0) — for MC connections or for 

special routeword unicast values - 

For UC connections, — although bits 29 : 5 are passed to 

the fabric, — only bits 29:20 are used. These bits should be pro 

g rammed with queue to be used. Dits 29:20 should be programmed 

with the priority and bits 27 : 20 programmed with the queue 
number . 



15 Note that the RW value used for the outbound memory 

controller is set to 

s 0 f & MC bit & connection ID (29:0). 



If the fabric is using 10 bits of conn ID, — this leaves 

20 bits — (1 M connections) — for use by the outbound memory 
2 0 controller . 



For double routewords, — no manipulation is done. The 

value passed in on the routeword bus needs to equal to the 
connection ID to be transmitted on the backplane. — The following 
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two tables show the routeword value which should be passed on the 
backplane routeword bus. — 



TABLE 4 



Unicast Conn e ction ID for s e parat e RW bus 



bi t 24 : 23 



Bi t 22 : 15 



Mult i cast b it— 0 



Fab r ic pr io r ity 



Fab r ic queue ID 



Futu r e expans i on bits. This b i ts a r e 



transmitted to the fab r ic, but the cur rent 



fab r ic igno r es t hem. Future fab r ics may 
ex p and t o suppor t these bits. 



TABLE 5: Multicast Connection ID for separate RW bus 



ull X j 



bi t 24:23 



bi t 22 : 16 



bi t 15:0 



Mult i cas t b i t - 1 



Prio ri ty queue ID 



Rese r ved. No t e these bits mul t icas t connec t ion ID (0 to 



a r c se nt to the fab ri c to G41C) used by t he fabric, 
allow — future — fab r ics — to 



s u pp o rt mo r e connection 
spacer 



Special routewords are flagged by using reserved queue 

10 numbers — (those in the range of 240 255) . These routeword values 

indicate the receipt of an OAM cell which must get routed to the 

control port or a queue resynch operation. These special values 

are always expressed in terms of the connection ID which goes to 

the fabric . If special routewords are given to the fabric, — the- 

15 memory controller routeword must also be modified if these are 
getting passed in using the separate connection number bus. 



The routeword passed to the fabric will contain the 

multicast bit and the port mask bits — (bits 23 : 1G) . The routeword 

passed to the outbound memory controller will maintain the port 

2 0 mask and also contain the vortex ID and the port ID. 
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The connection ID of an OAM cell has a special format 

generated by the Vortex AOIC: 



TABLE G: 



Conn e ction ID for OAM c e ll 



Bi t 24:23 



i t 22:15 



bi t 14: 9 



hi t 7:0 
|Port ID 



Multicast bit 1 0 | Vortcx ID (7:6) 



xrO (hex) 



Vortex ID (frO) 



r ese r ved 



IHre — Vortex — 3rB — field — is — used — to — indicate — which — source 

Vortex AOIC the cell comes from. — The port ID indicates which port 
the cell comes from inside the Vortex ASIC. Note that OAM cells are 
«rti — unicast . — All OAM cells are destined to one of — 19G blade and 

10 control — port — queues — programmed by — a — 0 bit — 9ftM — cell — destination 

register — in the memory controller AOICs. If separate routeword 

busses are being used, — bit 24:1G of the DID_CONN field will be 

passed to the fabric. The routeword which appears on the data bus 

(memory controller routeword) — should include the port mask, vortex 

15 ID and port ID fields in bits 23:0. — The value in the multicast bit 
is a don' t care for the memory controller routeword. 



Fabric queue ID 0xF0 -- 0xF7 of the unicast connection ID is 

reserved for software use. All packets which have the fabric queue 
ID in range of OxFO-OxFF will be redirected to one of the 4 control 
2 0 port queues based on a programmable register. 



54re — connection — 3-B — of — a — resync — cell — hers — t-fcre — following 

format . — 54°re — resync — cell — i-s — used — to — resynchronize — queues — dm — t+re 
memory controller AOICs. — Fabric queue ID OxFO-OxFF of the unicast 
connection ID is reserved for special — fabric functions . 
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TADLE 7 



Conn e ction ID for R e sync c e ll 



bi t 22:15 



bit 14:13 



bi t 12 : 0 



Mul t icast b i t- 0 



Prio r ity (unused) 



QxFF (hex) 



N u m b c i — oH Rese r ved 



sr ion t ies p e r p o r t 



54te — number — erf — priority — queues — per — port — c«m — only — be 

5 changed — during — fe-he — queue — resync — period, — i . e . , — when — a — fabric — «■ 
removed or inserted as follows : 



9-0-: — one priority per port for 400G switch, — pick bit 15 

down to 0 of the connection ID as the queue ID; 

01 : two priorities per port for 240G switch, pick bit 1G 

10 down to 9 of the connection ID as the queue ID; 

HH — A — priorities — per — port — for — 120G — err — smaller switch, 

pick bit 17 down to 10 of the connection ID as the queue 

Hi — reserved 



15 The resync cell can also be used to copy the shadow data 

register — to — a — valid — location — where — t+te — shadow — address — register 
points to. 

Shadow — control — cell — irs — used — t-o — copy — the — shadow — data 

register — fe-o — a — valid — location — where — fe-he — shadow — address — register 
2 0 points to. — The connection ID of a shadow control cell use. 



TABLE 0 : Conn e ction ID for Shadow Control C e ll 



bit 24 




bit 22:15 


bit 14:0 


Multieastbit-Q 




QxFC(hex) 
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Data coming into the DID bus and out of the DOD bus — i-s- 

assumed to be filled onto the busses from most significant bit to 
least significant bit — (highest number bit to lowest number bit) . 

The Gtriper AOIC accepts data from the ingress port via 

5 the Input Dus — (DID) — (also known as DIN_ST_bl_ch bus) . 

This bus can either operate as — 4 — separate 32 bit — input 

buses — (4xOC40c) — or a single 120 bit wide data bus with a common set 
of control lines to all stripers. This bus supports either cells or 
packets — based — cm — software — configuration — of — the — striper — chip . — ft- 
10 consists of the following signals : 

DID_Clock : This clock is sourced by the Dtriper ABIC at 

trp — to — 3-0-0 — Mi+z — and — i-s — used as — a — reference — £-err — data — and 
control signals on the DID. 

DID_DP : — This — signal — is — asserted — ( low) — to — indicate the 

15 striper — AGIC — cannot — take — data — cm — the — brts — ehre — to — a- 
bandwidth — difference — between — the — &fB — and — S-HB — busses . 
Interfac e s — which — rem — below — SB — MH-z — will — never — s-ee — this 
signal asserted. — At 10 0 Mhz, — this signal is asserted if 
more — than — GDD3G — bytes — of — back - to - back — data — srre — given . 
20 This — signal — should be — sampled at — thns — start — of — packet . 

During a packet transfer, this signal will be asserted if 
the FIFO conditions would caus e DP if the packet ended on 

-the — current — clock — cycle . ff — &P — is — asserted — the — clock 

cycle after the EOT, — the striper will eff e ctively ignore 

2 5 the input bus until the DF indication is withdrawn. The 

packet ingress stage should r e peat the first word of the 
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next packet transfer and then proceed with the — rest of 
the packet after the DP signal goes away. 

- DID_Valid_L : This active low input signal delimits valid 

data on the DID_GOP, — DID_EOP, — and DID_DATA busses. — !-£ 
5 this — signal — is — active, — the — busses — aire — assumed — to — be 
valid. — If high, — the busses are treated as having invalid 
data for the current clock cycle. If a transfer is not in 
progress — (rro — GOT without — EOF has — been — given) — then — th-e 
data bus is treated as invalid even if this signal is a 
10 one . — For cell interfaces, this signal can be tied active. 

DID_Cell_Fkt : — This signal is set to a one to indicate a 

cell transfer and a zero to indicate a packet transfer. 
Signal needs to be valid the same clock cycle as start of 
cell. 

15 DID_Data [127:0] : This is the input 120 bit data bus. If 

running in 32 bit mode, — a cell consists of a 4 byte RW, 
a 4 byte Header, — and twelve 4 byte data words. A packet 
has a RW and N data words, where 1 < N. If running in 120 
bit mode, — a cell has a 4 byte RW, — a 4 byte header, — and 0 

2 0 bytes of data in the first word, 2 words with 1G bytes of 

data, — and a final word with 0 bytes of data, — if the data 
starts on a word boundary. A following cell can start on 
the half-word boundary and have all — fields offset by 0 
bytes . — Packets in 120 bit mode work in the same fashion 

2 5 as 32 bit mode, — except that EOF and OOP can have larger 
values. Minimum packet length supported is 1G bytes. f£ 
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half word — boundary — cell — starts — srre — used, — the — correct 
value — (0/4 ) — needs to be given on the GOT bits 3 : 0. 

DID_EOP [4 : 0] : This bus has two fields. Dit 4 is a one to 

indicate an EOF on the current transfer — (if DID_Valid_L 
5 « — active) . — Bit — 4 — is a — zero to indicate no EOF on the 
current transfer . — Dits 3:0 give the offset of the last 

byte which is valid. The EOF field is not utilized for 

cell transfers. 

- DID_0OF/C [1 : 0) :This bit indicates a start of packet or 

10 cell on the curr e nt bus cycle (if DlD_Valid_L is active) . 

A value of zero indicates start of transfer, — a value of 

one indicates no start of transfer. Asserting bit 1-1 

indicates — that — the — upper — 6-4 — bits — carries — the — S&P — and: 
asserting — bit — fr^i — indicates — that — the — lower — 64 — bits 

15 carries the OOF (for 120 bit bus only) . For the 32 bit 

bus, GOF(0) should be used, OOF(l) should be tied high. 
For the — 120 bit bus, — if a packet — ends — in the upper — 6-4- 
bits of the bus, — a new packet can begin at bit G4. 

DID_CONN (24 : 0) :This is an optional bus. It can be used 

2 0 to pass — a — routeword to — the — striper AOIC to use as — the 

fabric routeword, — or the routeword can be transferred as 
the most significant 32 bits of the first word of data. 

The data should be valid the same cycle as OOF/C. ¥he 

value — during — non - 30P/C — cycles — is — a — don' t — care . The 

2 5 interface — is — statically — configured — to — either — tree — the 
separate connection number bus or to expect the routeword 
on the data bus. 
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Figure 5 shows a 32 bit DID cell transfer. 

Figure — G shows a DID back - pressure. 

Figure — 1 — shows — a — 32 — bit — M-B — packet — transfer — using 

external connection number bus. 

5 The unstriper ASIC sends data to the egress port via 

Output Bus (BOB) (also known as DOUT_UN_bl_ch bus), which is a 64 
(or 256) bit data bus that can support either cell or packet. It 
consists of the following signals: 

This bus can either operate as 4 separate 32 bit output 
10 buses (4xOC48c) or a single 128 bit wide data bus with a common set 
of control lines from all Unstripers. This bus supports either 
cells or packets based on software configuration of the unstriper 
chip. It consists of the following signals: 

DOD_Clock : This clock is sourced from the unstriper AOIC 

15 at up to 100 MHz and is used as a reference for data and 

control signals on the DOD. 

D0D_DF : — This — active low input — signal — indicates whether 

data earr — be transferred (inactive) err cannot be 

transferred — (active) . When back -pressure — i-s — asserted, 

2 0 t-he — unstriper — will — stop — advancing — Hre — output — btrs — and 
signal — data — ±-s — not — valid — using — the — D0D_valid — signal . 
Since synchronization must be done on both sides of the 
interfaces, — 0 clock cycles of data must be allowed from 




-34- 

the assertion of DP to data stopping. — The source driving 
DOD_DF cannot make any assumptions on the data stopping 
or restarting except by examining DOD_Valid. 

DOD_Valid_L : — This — active — tow — output — signal indicates 
5 whether the bus has valid data or not during a transfer. 

This signal indicates invalid data only when DOD_DP has 
been asserted. 

DOD_Data : — This is the output bit data bus. — It can either 
be — G4 bits wide or 25G bits wide. — If running in — 64 — bdrb 

10 mode, — a cell consists of a word with a 4 byte RW and a 4 
byte Header followed by G data words. — A packet has a RW 
and N data words, — where 1 ^ N. — If running in 2DG bit mode 
and — a — cell — starts — on an — even — 3-2 — byte — word boundary, — a- 
cell has a word with a 4 byte RW a 4 byte header and 24 

15 bytes of data in the first word, — and a second word with 

24 bytes of data. A following cell can start on the next 
used byte and have all fields offset by 0 bytes. — Valid 
cell — start — locations are all multiples of 0 — — 87 — 1-67- 
24 ) . — Packets in 120 bit mode work in the same fashion as 

2 0 32 — bit — mode, — except — that — EO-P — arrd — SOP — c?m — have — larger 

values. Minimum packet length supported is 1G bytes. ff- 

half-word — boundary — cell — starts — a-re — used, — the — correct 
value — ( 0 / 4 ) — needs to be given on the OOP bits 3:0. 

DOD_EOF: — This bit is asserted when the last transfer of 
2 5 a packet is occurring. 1 
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DOD_Cell_Pkt : — This signal is set to a one to indicate a 

cell transfer and a zero to indicate a packet transfer, 
Signal needs to be valid the same clock cycle as start of 
cell . 

5 D0DJ30F/C This bit — ts — a — zero — b3 — indicate — a — start of 

packet or cell on the current bus cycle. Data is always 

assumed to start at the most significant bit of the bus. 

Figure 0 shows a G4 bit DOD cell transfer. 

Figure 9 shows a 04 bit DOD packet transfer. 

10 Figure 10 shows an overview of the datapath of the switch 

AGICs. 

c Phe — data — cm — the — data — btrs — transports — an — optional — byte 

count — (32 bit word, — lower 1G bits are the byte count) — and a 32 bit 

egress routeword . The unstriper core will always produce a byte 

15 count . If a — segmentation engine is — used to break the packet — trp 

into cells, — then the segmentation engine will drop the byte count 

word before it is given to the cell interface. This dropping is 

only supported in OC4Q mode. In OC192 mode, — the chipset will have 

no provisions for segmentation and dropping the byte count word. 

2 0 TADLE 9: OC40 DOD format 

OC48 Bi t s OC 192 b it s fcabH Sag* 

63 : 48 255 : 240 U n used r ese r ved fo r unstri p e r use 

47 : 32 239:224 Dytc coun t Gives the eoun t of t he numbe r of bytes in the packet 

no t coun t ing t he 4 by t es for t he eg r ess r ou t ewo r d and 
the bytes fo r t he by t e eoun t (basically, t his co rr esponds 
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to the byte coun t of t he r eceived p acke t p lus/minus any 
changes fo r r ccncapsulation, pushes, o r pops.) 
223:192 Cgress RW Rou t ewo r d fo r the eg r ess memo r y co ntr olle r 

Nex t bits s t a rt the data (bi t s (191 t o 0) fo r 192, nex t 
cloeh cycle fo r QC48 

25 The Synchronizer has two main purposes. The first 

purpose is to maintain logical cell/packet or datagram ordering 
across all fabrics. On the fabric ingress interface, datagrams 
arriving at more than one fabric from one port cards ' s channels 
need to be processed in the same order across all fabrics. The 

30 Synchronizer's second purpose is to have a port cards 1 s egress 
channel re-assemble all segments or stripes of a datagram that 
belong together even though the datagram segments are being sent 
from more than one fabric and can arrive at the blade's egress 
inputs at different times. This mechanism needs to be maintained in 

35 a system that will have different net delays and varying amounts of 
clock drift between blades and fabrics. 

The switch uses a system of a synchronized windows where 
start information is transmit around the system. Each transmitter 
and receiver can look at relative clock counts from the last 

40 resynch indication to synchronize data from multiple sources. The 
receiver will delay the receipt of data which is the first clock 
cycle of data in a synch period until a programmable delay after it 
receives the global synch indication. At this point, all data is 
considered to have been received simultaneously and fixed ordering 

45 is applied. Even though the delays for packet 0 and cell 0 caused 
them to be seen at the receivers in different orders due to delays 
through the box, the resulting ordering of both streams at receive 
time = 1 is the same, Packet 0, Cell 0 based on the physical bus 
from which they were received. 
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Multiple cells or packets can be sent in one counter 
tick. All destinations will order all cells from the first 
interface before moving onto the second interface and so on. This 
cell synchronization technique is used on all cell interfaces. 
5 Differing resolutions are required on some interfaces. 

The Synchronizer consists of two main blocks, mainly, the 
transmitter and receiver. The transmitter block will reside in the 
Striper and Separator ASICs and the receiver block will reside in 
the Aggregator and Unstriper ASICs. The receiver in the Aggregator 
10 will handle up to 24(6 port cards x 4 channels) input lanes. The 
receiver in the Unstriper will handle up to 13(12 fabrics + 1 
parity fabric) input lanes. 

When a sync pulse is received, the transmitter first 
calculates the number of clock cycles it is fast (denoted as N 
15 clocks) . 

The transmit synchronizer will interrupt the output 
stream and transmit N K characters indicating it is locking down. 
At the end of the lockdown sequence, the transmitter transmits a K 
character indicating that valid data will start on the next clock 
20 cycle. This next cycle valid indication is used by the receivers 

to synchronize traffic from all sources. Refer — to — "K character 

usage" on page 34 for the mapping of K characters to the functions. 

At the next end of transfer, the transmitter will then 
insert at least one idle on the interface. These idles allow the 
25 10 bit decoders to correctly resynchronize to the 10 bit serial 
code window if they fall out of synch. 
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The receive synchronizer receives the global synch pulse 
and delays the synch pulse by a programmed number (which is 
programmed based on the maximum amount of transport delay a 
physical box can have) . After delaying the synch pulse, the 
5 receiver will then consider the clock cycle immediately after the 
synch character to be eligible to be received. Data is then 
received every clock cycle until the next synch character is seen 
on the input stream. This data is not considered to be eligible 
for receipt until the delayed global synch pulse is seen. 

Since transmitters and receivers will be on different 
physical boards and clocked by different oscillators, clock speed 
differences will exist between them. To bound the number of clock 
cycles between different transmitters and receivers, a global sync 
pulse is used at the system level to resynchronize all sequence 
counters. Each chip is programmed to ensure that under all valid 
clock skews, each transmitter and receiver will think that it is 
fast by at least one clock cycle. Each chip then waits for the 
appropriate number of clock cycles they are into their current 
sync_pulse_window. This ensure that all sources run N* 

sync__pulse_window valid clock cycles between synch pulses. 

As an example, the synch pulse window could be programmed 
to 100 clocks, and the synch pulses sent out at a nominal rate of 
a synch pulse every 10,000 clocks. Based on a worst case drifts 
for both the synch pulse transmitter clocks and the synch pulse 
25 receiver clocks, there may actually be 9,995 to 10,005 clocks at 
the receiver for 10,000 clocks on the synch pulse transmitter. In 
this case, the synch pulse transmitter would be programmed to send 
out synch pulses every 10,006 clock cycles. The 10,006 clocks 
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guarantees that all receivers must be in their next window. A 
receiver with a fast clock may have actually seen 10,012 clocks if 
the synch pulse transmitter has a slow clock. Since the synch 
pulse was received 12 clock cycles into the synch pulse window, the 
5 chip would delay for 12 clock cycles. Another receiver could seen 
10,006 clocks and lock down for 6 clock cycles at the end of the 
synch pulse window. In both cases, each source ran 10,100 clock 
cycles . 

When a port card or fabric is not present or has just 
10 been inserted and either of them is supposed to be driving the 
inputs of a receive synchronizer, the writing of data to the 
particular input FIFO will be inhibited since the input clock will 
not be present or unstable and the status of the data lines will be 
unknown. When the port card or fabric is inserted, software must 
15 come in and enable the input to the byte lane to allow data from 
that source to be enabled. Writes to the input FIFO will be 
enabled. It is assumed that, the enable signal will be asserted 
after the data, routeword and clock from the port card or fabric 
are stable. 

20 At a system level, there will be a primary and secondary 

sync pulse transmitter residing on two separate fabrics. There 
will also be a sync pulse receiver on each fabric and blade. This 
can be seen in Figure [[11]] 5,- A primary sync pulse transmitters 
will be a free-running sync pulse generator and a secondary sync 

25 pulse transmitter will synchronize its sync pulse to the primary. 
The sync pulse receivers will receive both primary and secondary 
sync pulses and based on an error checking algorithm, will select 
the correct sync pulse to forward on to the ASICs residing on that 



board. The sync pulse receiver will guarantee that a sync pulse is 
only forwarded to the rest of the board if the sync pulse from the 
sync pulse transmitters falls within its own sequence "0" count. 
For example, the sync pulse receiver and an Unstriper ASIC will 
5 both reside on the same Blade. The sync pulse receiver and the 
receive synchronizer in the Unstriper will be clocked from the same 
crystal oscillator, so no clock drift should be present between the 
clocks used to increment the internal sequence counters. The 
receive synchronizer will require that the sync pulse it receives 
10 will always reside in the "0" count window. 

If the sync pulse receiver determines that the primary 
sync pulse transmitter is out of sync, it will switch over to the 
secondary sync pulse transmitter source. The secondary sync pulse 
transmitter will also determine that the primary sync pulse 

15 transmitter is out of sync and will start generating its own sync 
pulse independently of the primary sync pulse transmitter. This is 
the secondary sync pulse transmitter's primary mode of operation. 
If the sync pulse receiver determines that the primary sync pulse 
transmitter has become in sync once again, it will switch to the 

20 primary side. The secondary sync pulse transmitter will also 
determine that the primary sync pulse transmitter has become in 
sync once again and will switch back to a secondary mode. In the 
secondary mode, it will sync up its own sync pulse to the primary 
sync pulse. The sync pulse receiver will have less tolerance in 

25 its sync pulse filtering mechanism than the secondary sync pulse 
transmitter. The sync pulse receiver will switch over more quickly 
than the secondary sync pulse transmitter. This is done to ensure 
that all receiver synchronizers will have switched over to using 
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the secondary sync pulse transmitter source before the secondary 
sync pulse transmitter switches over to a primary mode. 

Figure [[11]] 5, shows sync pulse distribution. 

In order to lockdown the backplane transmission from a 
5 fabric by the number of clock cycles indicated in the sync calcu- 
lation, the entire fabric must effectively freeze for that many 
clock cycles to ensure that the same enqueuing and dequeueing 
decisions stay in sync. This requires support in each of the 
fabric ASICs. Lockdown stops all functionality, including special 
10 functions like queue resynch. 

The sync signal from . the synch pulse receiver is 
distributed to all ASICs. Each fabric ASIC contains a counter in 
the core clock domain that counts clock cycles between global sync 
pulses. After the sync pulse if received, each ASIC calculates the 
15 number of clock cycles it is fast. (5). Because the global sync is 
not transferred with its own clock, the calculated lockdown cycle 
value may not be the same for all ASICs on the same fabric. This 
difference is accounted for by keeping all interface FIFOs at a 
depth where they can tolerate the maximum skew of lockdown counts. 

20 Lockdown cycles on all chips are always inserted at the 

same logical point relative to the beginning of the last sequence 
of "useful" (non-lockdown) cycles. That is, every chip will always 
execute the same number of "useful" cycles between lockdown events, 
even though the number of lockdown cycles varies. 
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Lockdown may occur at different times on different chips. 
All fabric input FIFOs are initially set up such that lockdown can 
occur on either side of the FIFO first without the FIFO running dry 
or overflowing. On each chip-chip interface, there is a sync FIFO 
5 to account for lockdown cycles (as well as board trace lengths and 
clock skews) . The transmitter signals lockdown while it is locked 
down. The receiver does not push during indicated cycles, and does 
not pop during its own lockdown. The FIFO depth will vary 
depending on which chip locks down first, but the variation is 

10 bounded by the maximum number of lockdown cycles. The number of 
lockdown cycles a particular chip sees during one global sync 
period may vary, but they will all have the same number of useful 
cycles. The total number of lockdown cycles each chip on a 
particular fabric sees will be the same, within a bounded 

15 tolerance . 

The Aggregator core clock domain completely stops for the 
lockdown duration - all flops and memory hold their state. Input 
FIFOs are allowed to build up. Lockdown bus cycles are inserted in 
the output queues. Exactly when the core lockdown is executed is 
20 dictated by when DOUT_AG bus protocol allows lockdown cycles to be 
inserted. DOUT__AG lockdown cycles are indicated on the DestID bus. 

The memory controller must lockdown all flops for the 
appropriate number of cycles. To reduce impact to the silicon area 
in the memory controller, a technique called propagated lockdown is 
25 used. 

The aggregator signals lockdown cycles on the DIN_ME bus. 
54re — memory — controller — does — rrot — push — during — these — cycles . 54te- 
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memory controller does not pop during lockdown to account — for the 

non - push cycles . 54°re FIFO depth i-s set during fabric 

synchronization to tolerate getting deeper or — shallower depending 
on who locks down first - 

5 Lockdown idle cycles are inserted on the DQUT and CIi_ID 

busses . — An extended sync signal is used to indicate the number of 
lockdown cycles on the DOUT_ME bus to aid the Oeparator' s lockdown 
function . 

The token bus lockdown looks the same as the DIN_ME bus 

from a memory controller perspective. — Non push cycles are signaled 

by — the — separators — according — to — their — lockdowns . The — memory 

controller does not pop during lockdown. — The Separator locks down 
completely in a manner similar to the Aggregator. — DIN_0F and CI1_ID 
lockdown — cycles — are — signaled — individually — per - bus — via — the — OYNC 

signals . &rty — continuous — SYNC — assertion — after — the — first — orte — is- 

considered a — lockdown — cycle . Lockdown bus — cycles — srre — not pushed 

into the input FIFOs . 

"Phe — chip - to- chip — communication — within — a — single — fabric 

must — be — synchronized. Although — no — clock — drift — exists — between 

2 0 chips, differences — im — track — delays — cause — data — to — arrive — at- 

dif f erent — Memory — Controllers — at — different — times . &-H: — Memory 

Controllers need to process — incoming packets — in exactly the — same 
logical order on each chip. — The Separators must align and combine 
multiple data slices coming from different Memory Controllers. — The 
25 Memory — Controllers — must — take — the — tokens — received — from — the 
Oeparators and apply them at exactly the same point in the logical 
packet — flow, — or drop decisions may differ from chip to chip. 
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The on-fabric chip-to-chip synchronization is executed at 
every sync pulse. While some sync error detecting capability may 
exist in some of the ASICs, it is the Unstriper's job to detect 
fabric synchronization errors and to remove the offending fabric. 
5 The chip-to-chip synchronization is a cascaded function that is 
done before any packet flow is enabled on the fabric. The 
synchronization flows from the Aggregator to the Memory Controller, 
to the Separator, and back to the Memory Controller. After the 
system reset, the Aggregators wait for the first global sync 
10 signal. When received, each Aggregator transmits a local sync 
command (value 0x2) on the DestID bus to each Memory Controller. 

The Memory Controllers do not push anything into a DIN 
input FIFO until the first sync command is seen on that bus. The 
sync and every bus cycle following is constantly pushed into the 

15 input FIFO. On the core side of the input FIFOs, no FIFO is popped 
until a sync appears in the FIFO from every Aggregator. After two 
additional margin cycles, every input FIFO is popped every cycle. 
After this point the input FIFO depths remain constant. The depths 
are roughly a function of the track delays from each Aggregator. 

20 Immediately after the Memory Controllers begin sampling the 
Aggregator input FIFOs, a sync signal (S_SYNC_L) is transmitted to 
all Separators on the DOUT and CH_ID busses. 

Like the Memory Controllers, the Separators do not push 
into the DIN and CH_ID busses until a sync signal is received on 
25 that bus. The sync and everything after is constantly pushed into 
the input FIFO. 
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On the core side the Separator always waits until at 
least one word is present on all input busses, and then pops the 
CH_ID and DIN busses simultaneously. This will logically align the 
data stripes coming from the Memory Controllers. After the first 
5 combined sync is popped from the input FIFOs, the Separators send 
a sync signal on the TOKEN bus to the Memory Controllers. 

The Memory Controllers — do — not push — into — the TOKEN bus 
input — FIFO until a — sync signal — (0x3F on the token bus) — has been 

seen on the bus. The sync and all subsequent tokens and idles are 

10 always pushed . 

All Memory Controllers need to apply the received tokens 

tro — t+re — same — point — in — t-he — incoming — logical — flow — in — order — £-err — 

drop decisions — to be — identical . This — is done by waiting a worst 

case number of clock cycles after the Separator sync transmission 

15 before beginning to pop the token input FIFO. — The worst case delay 
must be used because there is no way for a single Memory Controller 
to know exactly when all other Memory Controllers have received a 
token . — The programmable delay stored in the 1G bit Token Sync Wait 
Register — is — i-rt — "useful" — cycles — ( 125MIIz) — that — do — rrerfe — include — t+re 

2 0 fabric — lockdown — cycles . The worst — case — delay is — t+re — worst — case 

skew for all data paths going from the Aggregator to Memory Con- 
troller to Separator and back to Memory Controller. 



The following Table 10 gives the min/max delays which the 

chipset — supports — and represent — the — limits — of what — i-s — verified in 
2 5 the chip verification process. 
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Sync — pulse — transport — delay — from — Transmitter — to — &rry 

individual chip receiving the syne pulse — (WC path DC path) : — 5-6-9- 

rrS — (min delay of — — max delay of 000 nS) . &t — 17 5 pa /inch, — this 

works out to a difference of about 70m. Dackplane transport delay 

5 difference from local sync pulse receipt to reception of the sync 

indication — flag by the — far — end chips : — 5-fr6 — rrS-: Note — that — rt — rs- 

desired — to — allot — about — 2-5 — rrS — of — this — ■bo — the — chip — synchronizer 
operation which gives a delta path delay supported of 500 nG . 

Oscillators should — be HH3 ppm — oscillators . The 

10 assumption of the design was — that the difference in transmission 
path delay was less than or equal to clock drift. — On board delays 
between chips have been designed to exceed the following specs: 

Shortest net : — 0 . 2 5 {f , — transport delay of pretty much 0. 
Longest net: — 25" , — transport delay is — 5 — nO . 

For any signal distribution. — The net delta delay between 

chips — irs — a multiplier — o-f — t-fcre — number — of busses — the — sync — hers — tra - 
versed. Since the sync goes through a receive synchronization to 

^fre — local — clock of — t-he — chip, — an — H 9 — nO uncertainly has — to be 

added at each stage giving a net uncertainty of around 21 nG for 
each hop. 

TABLE 10 : — Fabric sync delay 

Notes 

Syne p ulse in 

Sy n e p ulse t o agg I agg_ me del t a 

Syne p ulse t o agg i aggjne i me_sep 
(no t e t his syne pulse is delayed by t he 



C h i p Number — offSkc ur 
busses 

* 1 T1 — O 

Memory £ 42 nS 

2 5 controller 

rvixi 
TTTTT 

Sep DIN } 
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mcmo r y — con t roller — for — pr o p agated 
lockdown). 

memo r y 4 84nS eve r yth i ng above l 3e p _mc tokens. 



controller 
3 0 token in 



Ihe — control — port — follows — the — same — cell — flow — grs — the 

regular — ports . — "Fhe — switch — control — processor — sends — cells — to — the 
striper AGIC; the striper stripes the cells and route words across 
all fabrics . — An additional aggregator — (9th) — AGIC sends cells via 
35 the DOUT__AG/Dest ID buses to all 12 memory controllers. Each memory 
controller AGIC has an additional 9th DIN_ME_f b_se_Q bus. — 

¥he — memory — controller — AGIC — will — route — the — incoming 

control — port — cells — to — any — one — of — the — control — port — destination 
queues and blade queues — (up to 190 queues) . The 9th D0UT_ME_f b_se_9 
4 0 bus is used to send the control cells to the 9th separator AGIC, 
which — sends — the — cells — to — one — of — several — destination — unstriper 
AGICs . — The — unstriper — AGIC — reconstructs — the — cells — from — gr±-fc — 9th 
separator AGICs across all fabrics. — It sends the complete control 
cells to the switch control processor it is connected to. 

45 Note that the control port destination queues can be part 

of any multicast cells such that the multicast port mask is neces - 
sary — to — include — additional — bit (s) — to — indicate — the — control — port 
queue (s) . — 

There — are — at — most — A — control — ports — in — any — switch 

50 configurations . — This — limitation — rs — dtre — to — the — aggregator — and 
separator AGICs only have 4 — 12 -bit channels which can be scalable 
to different switch configurations, — respectively . — In other words, 
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fotrs DIN_AG_fb_9_l_l, DIN_AG_f b_9_2_l, DIN_AG_f b_9_3_l / gmd 

DIN_AG_f b_9_4_l — erf — the — aggregator ASIC are — conn e cted to — xrp — to — 4- 
control port striper ASICs. Dus D0UT_SP_fb_9_l_l , DOUT_SP_f b_9_2_l , 
DOUT_SP_fb_9_3_l, arid DOUT_SP_f b_9_4_l of the separator ASIC are 
5 connected to up to 4 control port unstriper ASICs. 

The striping function assigns bits from incoming data 
streams to individual fabrics. Two items were optimized in deriving 
the striping assignment: 

1. Backplane efficiency should be optimized for OC48 
10 and OC192. 

2. Backplane interconnection should not be 
significantly altered for OC192 operation. 

These were traded off against additional muxing legs for 
the striper and unstriper ASICs. Irregardless of the optimization, 
15 the switch must have the same data format in the memory controller 
for both OC48 and OC192. 



Backplane efficiency requires that minimal padding be 
added when forming the backplane busses. Given the 12 bit backplane 
bus for OC48 and the 48 bit backplane bus for OC192, an optimal 
20 assignment requires that the number of unused bits for a transfer 
to be equal to (number_of_bytes *8 ) /bus_width where V" is integer 
division. For OC48, the bus can have 0, 4 or 8 unutilized bits. For 
OC192 the bus can have 0, 8, 16, 24, 32, or 40 unutilized bits. 
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This means that no bit can shift between 12 bit 
boundaries or else OC48 padding will not be optimal for certain 
packet lengths. 

For OC192c, maximum bandwidth utilization means that each 
5 striper must receive the same number of bits (which implies bit 
interleaving into the stripers) . When combined with the same 
backplane interconnection, this implies that in OC192c, each stripe 
must have exactly the correct number of bits come from each striper 
which has 1/4 of the bits. 

10 For the purpose of assigning data bits to fabrics, a 48 

bit frame is used. Inside the striper is a FIFO which is written 32 
bits wide at 80-100 MHz and read 24 bits wide at 125 MHz. Three 32 
bit words will yield four 24 bit words. Each pair of 24 bit words 
is treated as a 48 bit frame. The assignments between bits and 

15 fabrics depends on the number of fabrics. 



TABLE 11: Bit striping function 
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+24 to 
12:23 


tO 

12:23 


+24 to 
12:23 






















ft. 1 1 




1 A O 

3,4,9 


i, /,o 


1 ,0, 1 1 


















4 fab 


12:23 


15,16, 
21 


14,19, 
20 


13,18, 
23 


12,17, 
22 




















24:35 


26,3 1, 
32 


25,3 0, 
35 


24,29, 
34' 


27,28, 
33 




















36:47 


3 7,42, 

4*7 


3 6,41, 

40 


3 9,40, 

4J 


3 8,43, 

AA 




















0:11 


0,11 


1,4 


5,8 


2,9 


3,6 


7,10 














6 fab 


12:23 


14,21 


15,18 


19,22 


12,23 


13,16 


17,20 
















24:35 


+24 to 
0:11 


























36:47 


+24 to 
12:23 


























0:11 


0 


4 


8 


1 


5 


9 


2 


6 


10 


3 


7 


11 


12 fab 


12:23 


15 


19 


23 


12 


16 


20 


13 


17 


21 


14 


18 


22 




24:35 


26 


30 


34 


27 


31 


35 


24 


28 


32 


25 


29 


33 




36:47 


37 


41 


45 


38 


42 


46 


39 


43 


47 


37 


40 


44 



The following tables give the byte lanes which are read 
first in the aggregator and written to first in the separator. The 
25 four channels are notated A,B,C,D. The different fabrics have 
different read/write order of the channels to allow for all busses 
to be fully utilized. 



One fabric-40G 



The next table gives the interface read order for the 
30 aggregator. 



Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


B 


C 


D 


Par 


A 


B 


C 


D 
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Two fabric-80G 



Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


C 


B 


D 


1 


B 


D 


A 


C 


Par 


A 


C 


B 


D 



120G 



Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


D 


B 


C 


1 


C 


A 


D 


B 


2 


B 


C 


A 


D 


Par 


A 


D 


B 


C 



Three fabric-160G 



Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


B 


C 


D 


1 


D 


A 


B 


C 


2 


C 


D 


A 


B 


3 


B 


C 


D 


A 


Par 


A 


B 


C 


D 



Siz fabric-240 G 



Fabric 


1st 


2nd 


3rd 


4th 


0 


A 


D 


C 


B 


1 


B 


A 


D 


C 


2 


B 


A 


D 


C 


3 


C 


B 


A 


D 


4 


D 


C 


B 


A 


5 


D 


C 


B 


A 


Par 


A 


c 


D 


B 



Twelve Fabric-480 G 
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Fabric 


1st 


2nd 


3rd 


4th 


0,1,2 


A. 


D 


C 


B 


3,4,5 


B 


A 


D 


C 


6,7,8 


C 


B 


A 


D 


9,10,11 


D 


C 


B 


A 


Par 


A 


B 


C 


D 



Interfaces to the gigabit transceivers will utilize the 
transceiver bus as a split bus with two separate routeword and data 
busses. The routeword bus will be a fixed size (2 bits for OC48 

10 ingress, 4 bits for OC48 egress, 8 bits for OC192 ingress and 16 
bits for OC192 egress), the data bus is a variable sized bus. The 
transmit order will always have routeword bits at fixed locations. 
Every striping configuration has one transceiver that it used to 
talk to a destination in all valid configurations. That 

15 transceiver will be used to send both routeword busses and to start 
sending the data. 

The backplane interface is physically implemented using 
125 MHz interfaces to the backplane transceivers. The 125 MHz bus 
for both ingress and egress is viewed as being composed of two 
20 halves, each with routeword data. The two bus halves may have 
information on separate packets if the first bus half ends a 
packet . 

For example, an OC48 interface going to the fabrics 
locally speaking has 24 data bits and 2 routeword bits @125 MHz . 
25 This bus will be utilized acting as if it has 2x (12 bit data bus 
+ 1 bit routeword bus) . The two bus halves are referred to as A 
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and B. Bus A is the first data, followed by bus B. A packet can 
start on either bus A or B and end on either bus A or B. 

In mapping data bits and routeword bits to transceiver 
bits, the bus bits are interleaved. This ensures that all 
5 transceivers should have the same valid/invalid status, even if the 
striping amount changes. Routewords should be interpreted with bus 
A appearing before bus B. 

The bus A/Bus B concept closely corresponds to having tHtG- 
Mffe— interfaces between chips. 

10 All backplane busses support fragmentation of data. The 

protocol used marks the last transfer (via the final segment bit in 
the routeword) . All transfers which are not final segment need to 
utilize the entire bus width, even if that is not an even number of 
bytes. Any given packet must be striped to the same number of 

15 fabrics for all transfers of that packet. If the striping amount 
is updated in the striper during transmission of a packet, it will 
only update the striping at the beginning of the next packet. 

Each transmitter on the ASICs will have the following I/O 
for each channel : 

20 8 bit data bus, 1 bit clock, 1 bit control. 

On the receive side, for channel the ASIC receives 

a receive clock, 8 bit data bus, 3 bit status bus. 
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The switch optimizes the transceivers by mapping a 
transmitter to between 1 and 3 backplane pairs and each receiver 
with between 1 and 3 backplane pairs. This allows only enough 
transmitters to support traffic needed in a configuration to be 
5 populated on the board while maintaining a complete set of 
backplane nets. The motivation for this optimization was to reduce 
the number of transceivers needed. 

The optimization was done while still requiring that at 
any time, two different striping amounts must be supported in the 
10 gigabit transceivers. This allows traffic to be enqueued from a 
striping data to one fabric and a striper striping data to two 
fabrics at the same time. 



In all modes — of operation. 


; — rtre — entire — 3 . OG — of — data 


— » 


always supported on switch ingress. — 


For egress operation, — for 




-anti — QOG, — the number — of — transceivers 


needed to — support — a — full 




speedup — wsrs — deemed — to — expensive . 


For — these — switch — modes , — 




output speedup is between 1.5 and 2. 


All configurations above 





support a full 2x speedup. 

Depending on the bus configuration, multiple channels may 
20 need to be concatenated together to form one larger bandwidth pipe 
(any time there is more than one transceiver in a logical 
connection. Although quad gbit transceivers can tie 4 channels 
together, this functionality is not used. Instead the receiving 
ASIC is responsible for synchronizing between the channels from one 
25 source. This is done in the same context as the generic 
synchronization algorithm. 
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The 8b/10b encoding/decoding in the gigabit transceivers 
allow a number of control events to be sent over the channel. The 
notation for these control events are K characters and they are 
numbered based on the encoded 10 bit value. Several of these K 
characters are used in the chipset. The K characters used and 
their functions are given in the table below. 



TABLE 12: K Character usage 



10 



15 



K character Function 

28.0 Sync indication 



28.1 
28.2 



28.3 1 



28.4 

28.5 
28.6 



Lockdown 
Packet Abort 



Resync window 



BP set 

Idle 
BP clr 



Notes 

Transmitted after lockdown cycles, treated as the prime 

synchronization event at the receivers 

Transmitted during lockdown cycles on the backplane 

Transmitted to indicate the card is unable to finish the 

current packet Current use is limited to a port card 

being pulled while transmitting traffic 

Transmitted by the striper at the start of a synch 

window if a resynch will be contained in the current 

sync window 

Transmitted by the striper if the bus is currently idle 
and the value of the bp bit must be set. 
Indicates idle condition 

Transmitted by the striper if the bus is currently idle 
and the bp bit must be cleared. 



The switch has a variable number of data bits supported 
to each backplane channel depending on the striping configuration 
for a packet. Within a set of transceivers, data is filled in the 
following order : 

20 F [fabric] _[ocl92 port number] [oc48 port designation 
(a,b,c,d)] [ transceiver_number ] 

Everything — in — the — documentation — irs — done — frrr — f abric-1, 
which is the case where all connections are needed. — The only part 
of this which is used for fill order is transc e iver_number — (OC40 ) 
25 and transceiv e r number and oc40 port designation for OC192. 
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The fundamental rules for mapping are the following: 

i-. &P — i — RW are on transceiver 1 — These always occupy the first — 4- 

bits of the transceiver. 

zh Data bits starting with the least significant bit are filled 

5 into the data bus — in a 2 bit bit interleaved pattern, — with bus A 
and bus D pairs. 

Transceivers are filled in starting at bit 0 of their transmit 

and receive interfaces. 

4-t — All multibit routeword fields are transmitted LSD to MOD. — This 
10 includes connection number, number of fabrics and encoded values of 

stop /align /final — segment . T+te — overall — routeword — is — notated — » 

starting from bit 0 — (least significant bit) and up. — Transmit order 

is Dit 0 — (GOP) — goes on the first routeword bit, followed by bit 1 

(Packet type) . If multiple routeword bits are transmitted in the 

15 same clock they are filled in starting with the first bit going to 
bit 0, — the second bit going to bit 1, 

Data — should — be — encoded — artd — decoded — based — cm — a — btrs — A/Dus — & 

order . 

-6-. Per* — OC102, — tfcre — fill — order — should be — btrs — £tt — &-, — &7 — B — fer 

2 0 routeword bits. For — data bits, — the fill order depends on wack - 

ing/unwacking/reverse unwacking and reverse wacking functions. 



Transceiver 1 
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For an ingr e ss bus, — the format of data is the following: 



Brtr 


-e- 


— BP 


Bi-te- 


■4- 


— e- 


Bit- 


-2- 


RWA 


Bi±r 


-e- 


RWD 


Bi±r 




Dataa (0) 


Qrirt- 


-5- 


Dataa (1) 


B±tr 




-6- 
-=h 


Datab (0) 
Datab(l) 



10 Note that for 12 fabric mode, — bits 5 and 7 — are unused. 

The location of datab (0) — does not change. 

For — £fre — egress — bus , — t+re — format — erf — t-he — data — i-s — t+re 

following : 



Brt- 


-e- 


RWA ( 0 ) 


frrt- 




RWA ( 1 ) 


Brfc- 




RWD ( 0 ) 


&±tr 


-e- 


RWD ( 1 ) 


Brfe- 




Dataa (0) 


Brfc- 


-&- 


Dataa (1) 


Brfc- 

Brfe- 


-6- 
-=h 


Datab (0) 
Datab (1) 



Transceiver 2 and up 

Fill up the data bus starting at each transc e iver bit 0 

to bit 7 with 2 bit interleaved 

25 dataa/datab patterns. 

For example, — transceiver 2 has the following pattern: 

Dit 0 — dataa (2) 
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Dit 1 dataa (3) 

Dit 2 — datab (2) 

Dit 3 — datab (3) 

Dit 4 Dataa (4) 

5 Dit 5 Dataa (5) 

Dit G Datab (4) 

Dit 7 Datab (5) 

The stop/align encoding depends on the width of the bus interface. 



TADLE 13: — OC40 poi-tcard to fabric routewoud stop/align 



I 1VIVJ 


Length 


Function 


Stop/Align 


2 i n (where 
n is the 
number of 




clock cycles 


Stop bit is a 1 to indicate no stop, aero indicates atop. Stop bits repeat in a serial stream until a 


of transfer) 


stop bit of zero is seen, followed by the align bit and FS. Since stop is followed by the align and 




FS bits, the stop bit is given 2 clock cycles before the end of data. 


Align bit is a one to indicate valid data on the last complete byte on the interface. For odd 12 bit 


words(assuming zero based counting), align - 0 indicates bits 0:3 are valid, and bits 4:1 1 arc 


invalid. Align - 1 for these words indicates that all 12 bits arc valid. For even words, align should 
normally be a 1. 


Short packets are indicated by signaling a stop on byte 53 of the transfer. In reality, 54 bytes will 


be transferred, but the packet is flagged as a short packet. 


Final segment is a one to indicate a final segment of a packet and a zero to indicate a partial 


segment of a packet. Only one packet can be in transit at any one time on this bus. This bit is only 


valid for packets. For eclls this bit 3hould be a one. Packets which arc not final segments should 


be terminated only on odd cycles with all bits utilized. 









TADLE 14: — OC1Q2 portcard to fabric rout e word stop/align 



1 IVTIU 


Length 


Function 


Stop/Align 


3 i 4* 
number of 


Due to length restrictions on this bus, the stop/align has to be treated differently than for OC4G 
transfers. 


extra clocks 


The first clock eyele, this field is 3 bits long and is notatcd as SAFO. In all future clock cycles the 


stop field is 4 bits long and notated SAF1. The definitions of SAFO and SAF1 are given below. 


SArO(2: 1>"00" indicates full word transfer. 
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"01 " indica t es a full wo r d tr ansfe r but fo r a sho rt p aeke t . 
MO" i ndica t es a full wo r d t r ansfe r bu t n ot t he final segment. 
''11" i s r ese r ved. 

SAri(Q) Di t ze r o is a ze r o to indicate a stop, a one t o indica t e no s t op on the eu rr ent eyclc. 
SAFl(3:l) - b i na r y value of the numbe r of valid bytes. Zero is r eserved and 7 i s used to ind i ca t e 
0 bytes valid bu t not t he final segment. G indica t es 6 by t es valid and final segment. All partial 
wo r d tr ansfe r s au t omatically indicate an im p lied final segmen t . 



TABLE 15: — QC48 Fabric-roil card routcword stop/align 



■ ' IUU 


I xl_ 

Li VII £111 




Stop/Align 


3H-2-* 


Value is treated as a repeated 2 bit value (encoded stop) followed by the final segment bit. 
Stop field is interpreted as: 

QO 1st byte finished is valid and stop 
01 "2nd bytes finished is valid and stop 


10-3rd byte finished is valid and stop, or non-final segment. 




Final segment is a one for a final segment, a zero for a continuing packet. For final segments, 
the stop field should be encoded as a "10" 









The port card - fabric interface at OC192 variable routcword bits arc given in the tabic below. 



TABLE 16: — OC1 9 2 Fabric-port card r outcword stop/align 



r?» -i -i 
1' ICIU 


Length 


Function 


Stop/Align 


7 l 8* number 


Bft 0 indicates stop. Zero indicates stop, 1 continue. 


transfer 






Values OxE, OxF are reserved. Any non-12 byte ending offset automatically signals end of segm 


cycle of data. 

Short packets arc indicated by flagging a stop at byte 53. 
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Depending on the switch configuration, — the bus may not 

transfer — an — integer — number — of — bytes . This — is — handled — by — the 

interface always flagging the bytes which finish and the transmit 
arrd — receive — state — machines — must — track where — bytes — begin — arrd — errd: 
5 based on the current cycle in the transfer. 

The — btrs — consists — of — a — multiplexed — address/data — btrs- 

(AD_DATA) , a select signal (AD_GEL_L) , a read/write signal (AD_RW) , 
and a bus transaction complete indication signal (AD_RDY_L) . AD bus 
is used for read/write access of control/status registers. 

10 frt — order — to — write — to — a — control /status — register, — the 

read/write signal — (AD_RW) must be low. The select signal — (AD__GEL_L) 
must — be — asserted — tow — for the — entire — duration of — the — access , — and- 
values must be placed on the AD_DATA bus in the following sequence 
(cycle — 9 — rs — the — first — cycle — where — AD_GEL_L — irs — tew — forr — this 

15 transaction) : 

■+ cycle 2 5 : Data to be written to control/status register. Fcrr 

registers — that — are — wider — than — 0-bits — (maximum — of — 32 bits) 
write data must be presented one byte per cycle starting with 
LSD . — Any data — presented on — the — btrs — beyond — the — width — of — the 
2 0 register will be — ignored. 

* cycles — > — En — AGIC — will — assert — AD_RD¥_L — on — completion — of the 
write access, — and will keep it asserted until AD_GEL_L is de 
asserted. — 



Figure 12 shows a Write Cycle. 
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irt — order — t-e> — read — from — a — control/status — register, — the 

read/write signal (AD__RW) must — be — high. Ffre select signal 

(AD_OEL_L) — must — be — asserted — tow — for — the entire — duration of — t+re 
access, — and — values — must — be — placed — cm — the — AD_DATA — bvrs — irt — the 
5 following sequence — (cycle 0 is the first cycle where AD_OEL_L is 
low for this transacti on ) ! 

' cycle 0 — — Address of control/status register 

cycle 2: AD_DATA bus should be released (hi-z) 

cycles — >Sn — When the data — rs — available, — ASIC will — drive the 
10 read data onto the bus, — one byte per cycle — for — four cycles, 

along with assertion of AD_RD¥_L signal. For registers smaller 
than 32 bits wide, unused bits are presented as zeros. The LOD 
is present on the bus during the 1st clock cycle of AD_RDY_L 
assertion . 

15 Figure 13 shows a Read Cycle. 

Wre — switch — chips — will — generate — interrupts — on — error 

conditions . 54°re interrupt lines have the following 

characteristics : 

in Level Oensitive 



2 0 2n Active Low 
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-Eh Asynchronous (it© — clock — generated — to — go — along — with — the 

interrupt) . 

4-. Assume point - to - point — interconnection with board logic which 

combines together interrupts. 

Interrupts are maskable on a condition by condition basis 



inside — each — chip, 



The — interrupt — signal — is — asserted — on — the 



occurrence — of — sm — error — condition — artd — is — cleared when — the — error 

condition — ±-s — cleared. Any temporary conditions — which — caused an 

interrupt are recorded in the chip so no phantom interrupts should 
10 be seen. 



The reality of the switch is that errors will occur. — 54re 

intent — in the following is to detail the expected system behavior 
and recovery strategy needed for each error type. 



TABLE 17 : Error r e cov e ry in th e AOICs 



15 



Error 



De t ec t ion Mechanis m 



Error rec o very requi r ed 



Hard w are commen t s 



i tuek bi t on po rt ca r d eg r ess 



uns t ripe r sees da t a co rr upt i on 
f r om one fab r ic 



Stuck bit between agg & 



memory co n t r olle r 



unst r i p e r sees data corruption 
f r om one fabric, c i the r r out e 



wo r d o r da t a. 



i tuek bit between memo r y 



2 0 co n t r oller & sepa r a tor 



uns tr i p e r sees da t a co rr u pt ion 
f r om one fab r ic, ci t he r ro ute 



wo r d o r da t a 



i tuek bit on fab r ie eg r ess 



oft - fail o n r outewo r d f r om 
p o rt card 



A t leas t two uns tr i p e r s see e i the r Queue r esyneh 
a r ou t ewo r d misma t ch, a s t at e 



w i th a h i gh numbe r of ro u t ewo r d 



misma t ches, or da t a pari t y err o rs 



o r any n umbe r of uns tr i p e r s w i ll 
3ee a r ou t ew or d m i sma t ch, a 



h i gh n umbe r of r ou t ewo r d mi s 



ma t ches o r da t a p a r ity e rr o r s and 



an agg r ega t o r w i ll see a synch 



Wo r s t case — scena r io involves 



fa i l in g ro u t ewo r d with d i fferent 
fab r ie — r outewo r ds — to — fab r ics. 



Cithe r queueing a p acke t t o the 



w r ong p o r t or dropping the 



tr affic — m — the — agg r ega t or can 



cause a n — im p ac t t o all ports. 



P r obab i lity o f im p ac t ing more 



p o rt s goes u p wi t h t raffic load 
and — me m o ry u ti liza t io n — m 



memo r y eon tr olle r s r- 



Soft - fail on da t a f r om p o rt" [Uns tr ipe r sees o ne ti me errorjNonc 
card 



probability o f au to ma t ic hard 



wa r e baaed da t a r ecove r y is high 



oft - fa i l between agg/mcmo r y At least two unst rip e r s ace eithe r Queue rcsynch 



controlle r dest id bus 



a r ou t cwo r d misma t ch, a state 



with a high numbe r of r ou t cwo r d 
wisma t ehes, o r da t a 



p a r i t y 



errors 



soft - fail between agg/mcmo r y [Uns tr i p c r sees one t ime e rr o r jNonc 
controller da t a bus 



pr obability of automa t ic ha r d 



wa r e based da t a r ecovery is high 



soft - fail between memo r y 



At leas t two unst r i p e r s sec cithe r Queue r csynch 



con tr oller/separa t or channel ID a r ou t e w or d m i smatch, a state 



bus 



tth- 



high 



jmbe 



misma t ches, o r da t a p a r ity e rr o r s 



Tokens ge t ou t of syneh. May 
3cc e rr o r o f TITO overflow in 



the — se p a r a tor , — depending — on 



t r affic p atte rn . Need conges t ion 



o n the fab r ic fo r a port t o have 
me — nrO — ove r flow — become 



p ossible. — May also sec excess 
tokens i n memoryjso nt rollc r . 



soft - fa i l between memo r y 



Packe t — bounda r ies — from — one Queue Rcsynch 



co ntr olle r /se p a r a t o r data bus fo r s e p a r a t o r p o rt a r e lost. Unst r i p e r 



RW da t a 



will show a la r ge numbe r oi 
e rr o r s for all traffic fr o m t he 



arTce t ed agg r ega t o r ou t pu t . 



Inhe r ent tha t no self - s t abilize i n 



occu r s w/o queue r csynch. 



soft - fa i l between memo r y 



Single port sees one - time error. 



Kl 



controlle r /se p arato r data bus fo r 
packet data 



3of t- fail on toke n bus f r om Mismatches f r om fab r ic due to Queue Rcsyneh 



se p a r ato r to memory cont r oller diffe r ences in — se p a r a t or 



scheduling : 



sof t- fail inte r nal t o fab r ic eh ip s Uns tr ipe r sees diffe r en t t raffic Rese t 



f r om fab r ie t han o t he r fab r ics 



Queue Rcsynch may f i x t h e 



pr oblem, rese t is necessa r y fo r 



r esto r ing s tater 



agg r ega t o r neve r sees back Agg r ega t o r never — sets — flag Re p lace faulty ha r dwa r e. 



same as below 



pla n e i dle to synch r onize t o r w i nd i cat i ng it has seen back p lane 
bus sync ^ 



agg r egato r neve r sees sys t em [ Agg r egato r — neve r — sets — flaefRe p laec Taulty ha r dwa r e 
synch 



L o ca t ing faul t r equi r es see in if 



i n dicati n g i t has seen back p lane 
sync 



onry — this — boa r d — rs — hav i ng 



pr oblems — (back p lane — syne 



r eceive r ) or if mul tip le boa r ds 



a r c r e p o rt ing problems (l o st bo t h 



3yne sig n als on t he back p lane) 
E rr o r i sola t i o n in 40G swi t ch 



r equ ir es lo o king a t t he s t ate o f 



the — seconda r y syneh — pulse 



re- 
generator 



memory con tr olle r does no t see 



Ret r y r csyneh o r if permanent 



3yneh f r om agg 



r e p laee faulty hardwa r e. 



se p a r a tor does n ot see syn e h 



f r om mem e on t 



Se p a r a t o r neve r ge t s — ini t ial 
synch 



r e p lace faul t y ha r dwa r e 
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ns tr i p e r decs no t see back 
lane idle 



Uns t ripcr never gets back plan cfr c p laee faulty ha r dwa r e 
3yneh 



fab ri c chi p s no t ini t ialized 



Chi p s do no t d o anyth i ng 



I n it i al i ze t he ha r dware 



Fault can be caused by fa i lu r e of 



the on - boa r d pr ocesso r . — If soft 1 



fail, watchdog shou l d catch ifc- 



i t r i p c r no t i nitialized 



T r ansmit no da t a on the back Ini t ialize st r i p e r 
plane 



U n at rip e r no i nitial i zed 



All i n coming da t a i gno r ed 



Ini t ialize unst r i p c r 



St rip e amoun t i ncor r ect 



Offending da t a is dropped in Co rr ee t st r i p e amoun t 



st r i p er, i nte r rup t asserted 



Detectio n comes up as a r esult o< 



a — disag r eemen t — between the 



str i p e — am o un t — and the 



configuration r egis t e r fo r the 



s witeh o p e r a t ing mod er- 



P ri ma r y syne p ulse TX failure 



ynch p ulse r eceive r on all Hep lace boa r d with pr imary TX 



b oa r ds will see e rr o r o n pr ima r y 



and switeh t o second ary: 



Seconda r y sy n c pulse TX 
fa 1 1 u i c 



lynch p ulse r ece i ve r on — aH 
boa r ds will — see — error — en 



sec o nda r y ? 



10 



i y n c p ulse r eceive r failu r e o n 
one board 



If leav i ng r ese t , no chi p s on Re p lace boa r d with bad synch Need to see how wide e rr o r is 



boa r d get in sync. — If during p ulse r ece i ve r 



ope r ation, should sec a syncl 1 



s pr ead t o attemp t to i de n t i fy the 



source. 



e r ro r ci t he r in a n agg r ega t or or 



an unst r i p c r fed by this block. 



Boa r d loses single syne p ulse 
i n t e r nal t o the boa r d 



M 



If any — FIFOs — ove r flow 



agg r ega t o r or unstri p c r , queue 

rcsyneh 



la r d failu r e o n syne pulse May see FIFO ovc r fl o w/undc t Re p laee 



15 dist ri but i o n t o a s i ngle chi p on flow in fab r ic ch ip o r see syneh 



a fab r ic 



failu r e f r om t he down s tr eam 



ship-. — Addit i onally, if data is 



co rr u p ted, the — unst ri pcr will 



r e p o r t data co rr u pt i o n f r om t he 
associa t ed fab r ic 



la r d fa i lu r e on sync p ulse 



unst r i p c r- May sec wha t lo o ks Reset p o r t ea r d 



s ame as below. 



d i s tr ibu t ion to a si n gle chi p on 



like a single fab r ie misma t ch due 



a p o r t ea r d 



t o o n e fab ri c goi n g ou t of sy n ch 
befo r e the othe r s. 



2 0 3of t failu r e on syne p ulse 



None 



If n o TIFO ove r flow, none. — H 



S tr i p e r miss i ng — syneh — puls e 



d i s t ribu t ion t o a single ch ip on 



FIFO ove r flow, need t o reset could ove r flow a TIFO on eve r y 



a po rt ea r d 



b oa r d(s) with FIFO ove r flow. 



fab r ic Rec o ve r y would need t o 



b e done serially and — sw it ch 



could be effectively down by 



th i s e rr o r . — Only way to ensu re 



all fab r ies do t he same thing is t o 



ensu r e t ha t da t a path has th e 



same delay as the synch p a t h 
since — the wr i tes — o ccu r at 



diffe r en t l o gieal t imes. 



An — uns t ripcr missing — w o uld 



affee t t he o u t put por t mapped t o 



t he s tr ipe r and w o uld requ ir e a 
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25 











on a fabric 




Reset the fabrie 




soft failure on syne pulse 


Same as singlcfailurc case 




Same as single-failure. 











The chipset implements certain functions which are 
30 described here. Most of the functions mentioned here have support 
in multiple ASICs, so documenting them on an ASIC by ASIC basis 
does not give a clear understanding of the full scope of the 
functions required . 

The switch chipset is architected to work with packets up 
35 to 64K + 6 bytes long. On the ingress side of the switch, there 
are buses which are shared between multiple ports. For most 
packets, they are transmitted without any break from the start of 
packet to end of packet. However, this approach can lead to large 
delay variations for delay sensitive traffic. To allow delay 
40 sensitive traffic and long traffic to coexist on the same switch 
fabric, the concept of long packets is introduced. Basically long 
packets allow chunks of data to be sent to the queueing location, 
built up at the queueing location on a source basis and then added 
into the queue all at once when the end of the long packet is 
45 transferred. The definition of a long packet is based on the 
number of bits on each fabric. — The following table gives the size 
of long packets for different switch sizes. 
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TADLE 10 : Long racket siz e s 



Swi t ch Si z e Packe t Size 

i\ A. \ 

A A AAA 

e ?tT ^UU 

OA 1 O AA 

Ow Tuuw 

t *\ A rVT/YQ 

TTT7 i / UU 

i /*a i^nn 

TT7T7 JUUU 

-> /I A ClAA 

4TU J~T\J\J 

A OA A^AA 

tuu 70TO 



If the switch is running in an environment where Ethernet 
10 MTU is maintained throughout the network, long packets will not be 
seen in a switch greater than 40G in size. 

A wide cache-line shared memory technique is used to 
store cells/packets in the port/priority queues. — The shared memory 
-is OK entries — x 200 ii bit — wide — running at 125MHz . Each memory 

15 controller AGIC yields 25Gbps memory bandwidth. — The aggregator #9 
(control port) — generates at most — 4 — str e ams of OC 40 — traffic. — Wre 
enqueue — and dequeue — speed — for — different — switch configurations — i-s- 
shown — in — fe+re — following — table . — Note — that — a — 2-x — speedup — ear* — be 
achieved for all switch configurations except the 400Gswitch. Up to 

20 234,057 cells can be stored in the 4Q0G switch. The shared memory 
stores cells/packets continuously so that there is virtually no 
fragmentation and bandwidth waste in the shared memory. 

For the — short packets /cells, — memory utilization can b e 
close to 100%. — For the long packets, — the memory block before the 

2 5 start — crf — a — long — packet — cart — be — almost — completely — wasted. The 

minimum — length — fct — a — long — packet — rs — 3 — cach e — lines, — giving — an 

effective — utilization — of memory — close — tt> — since — 1 — ortt — crf — 4- 

memory cache lines can b e wast e d. — 
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TADLE ID : — Shared Memory (1,030, 4 00 bits) in Each Memory Controller 



S w i t ches 



1 Z.\J\J 



-tuuu 



Enqueue Dequeue S p eedu p Cell Leng t h 



c i 



4.3Gbps 



4.7Gbp 3 



S.OGb p s 



§.3Gb p s 



-7/^1 



9.4Gb p s 



C* 1 



20.7Gb p s 



20.3Gb p s 



20Gb p s 



19.7Gbpa 



18Gb p s 



15.6Gb p s 



39 1 1 bits 



21 H bi t s 



15H bits 



12H bi t s 



J l 1 bits 



6 I 1 bits 



Number o f 



102,400 



126,030 



163,840 



234,057 



There exists ttp — to — zMW> multiple queues in the shared 
10 memory. They are per-destination and priority based. All 
cells/packets which have the same output priority and blade/channel 
ID are stored in the same queue. Cells are always dequeued from 
the head of the list and enqueued into the tail of the queue. Each 
cell/packet consists of a portion of the egress route word, a 
15 packet length, and variable-length packet data. Cell and packets 
are stored continuously, i.e., the memory controller itself does 
not recognize the boundaries of cells/packets for the unicast 
connections. The packet length is stored for MC packets. — There is 
a limitation of 4K packets — (or cells) — in each of the MC queues. 



20 The multicast port mask memory 64Kxl6-bit is used to 

store the destination port mask for the multicast connections, one 
entry (or multiple entries) per multicast VC. The port masks of the 
head multicast connections indicated by the multicast DestID FIFOs 
are stored internally for the scheduling reference. The port mask 

25 memory is retrieved when the port mask of head connection is 
cleaned and a new head connection is provided. 



Two configurations of port mask memory are supported : 
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trz OK port connections, — for a 240 G switch 

ter: 4K connections, — for a 400 G switch. 

Dequeue performance is restricted by several factors : — 1-)- 

Padding injected by the aggregator AGICs; 2) Left alignment entries 
5 inserted in the memory controllers; 3) Memory controller output bus 
fragmentation — caused by — the multicast — connections; — 4-) — Token — bvts- 
latency — between — the — separators — smd — the — memory — controllers ; — Bf- 

Oeparator — output — bvrs — padding; — and — 6-) Unstriper — output — bus 

fragmentation . — A 400G switch is used as an example to analyze the 
10 worst - case — performance — since — it — has most — padding, — overhead, — and 
congested traffic . 

The aggregator AGICs have to pad a packet — (including 3 6- 
bit route word, — variable length packet length field and datagram) 
to multiples — of — 1-2 — since — there are — 12 memory controllers — in one 

15 fabric . — The shortest packet each memory controller received is 7- 
birb — long — since — a — packet — «m — be — as — short — ere — 04 bit — long . — The 

effective datagram is 3 bits. One entry will be — left aligned for 

every 1G 200-bit memory entries. — The left aligned entry can be as 
short as 1 bit long. The worst-case datagram dequeue efficiency per 

2 0 output port of a memory controller is: 

(10 bit — (dout_me bus width) — * — (3/7) — (datagram length in a shortest 
packet) — * — (15/ 1G) — (left- aligned overhead) ) — * 2 50MIIz — (output bus 

speed) — * 12 — (number of memory controllers) — /-2r% — (number of output 
ports per separator) — - 502Mbps — 
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54°re — best - case — output — data bus — bandwidth per — separator 

channel — its — 2 bit — * — 250MIIz, — i.e. , — 500Mbps . — fn — other — words, — [ Pfre 
worst - case dequeue bandwidth of a memory controller is bigger than 
the best - case output bandwidth of a separator port. — 2x speedup can 
5 be achieved through the twice wide output bus of the separators. 
Orre — sync — cycle will be — fired on the output bus — of the — separator 
every 120 cycles. — 

Wte — output bus — of the unstriper AOIC — irs — 04 bit — wide at 

100MHz . — It can only carry one packet per cycle. — In the worst case, 
10 up to 5G bits are wasted per packet for an 0C4Q port. 

APS stands for a Automatic Protection Switching, which is 
a SONET redundancy standard. To support APS feature in the switch, 
two output ports on two different port cards send roughly the same 
traffic. The memory controllers maintain one set of queues for an 
15 APS port and send duplicate data to both output ports. 

To support data duplication in the memory controller 
ASIC, each one of [[192]] multiple unicast queues has a 
programmable APS bit. If the APS bit is set to one, a packet is 
dequeued to both output ports. If the APS bit is set to zero for 
20 a port, the unicast queue operates at the normal mode. If a port 
is configured as an APS slave, then it will read from the queues of 
the APS master port. For OC48 ports, the APS port is always on the 
same OC48 port on the adjacent port card. 



Port mirroring is similar to the AP3 except that any port 
2 5 can pair with any port. — Only one pair of port mirroring ports are 
supported. — A 1G bit port mirror register is used to identify the 
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master and slave port — involved in the port mirror operation. — Frirt 
ports are compared to the master portion (bit 15:0) of the register 
when dequeuing. — Port mirror can be disabled. — Note that a port can 
either have AP3 enabled or — port mirroring enable, — rwt — both . — 54°re 
5 value — erf — the — port mirror — register — can be — changed on fly by — t+re 
shadow registers . 

The shared memory queues in the memory controllers among 
the fabrics might be out of sync (i.e., same queues among different 
memory controller ASICs have different depths) due to clock drifts 
10 or a newly inserted fabric. It is important to bring the fabric 
queues to the valid and sync states from any arbitrary states. It 
is also desirable not to drop cells for any recovery mechanism. 

A resync cell is broadcast to all fabrics (new and 
existing) to enter the resync state. Fabrics will attempt to drain 
15 all of the traffic received before the resynch cell before queue 
resynch ends, but no traffic received after the resynch cell is 
drained until queue resynch ends. A queue resynch ends when one of 
two events happens: 

1. A timer expires. 
20 2. The amount of new traffic (traffic received after the resynch 
cell) exceeds a threshold. 

At the end of queue resynch, all memory controllers will 
flush any left-over old traffic (traffic received before the queue 
resynch cell) . The freeing operation is fast enough to guarantee 
25 that all memory controllers can fill all of memory no matter when 
the resynch state was entered. 
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Queue resynch impacts all 3 fabric ASICs. The 
aggregators must ensure that the FIFOs drain identically after a 
queue resynch cell. The memory controllers implement the queueing 
and dropping. The separators need to handle memory controllers 
5 dropping traffic and resetting the length parsing state machines 
when this happens. For details on support of queue resynch in 
individual ASICs, refer to the chip ADSs. 

Multicast connections are enqueued into one of 4 priority 
queues based on the 2 bit priority number. — They are stored cache ■ 
10 line based like the way unicast conn e ctions do. — Conn e ction numb e rs 
end — lengths — srre — stored — into — erne — of — 4 — IK entry — per priority 
connection FIFO. Multicast packets are subject to be dropped if the 
destined — connection — FIFO — irs — full . — In — other — words , — srt — most — H£ 
multicast packets can be stored simultaneously for each priority. 

15 54-re — G4KxlG -bit port mask memory will limit the number of 

multicast connections supported to G4K, — 32 K, — 1GK, — 1GK, — 8-K7 — and 4K 
for the 40G, 00G, 120G, 1G0G, 240G, and 400G switch, respectively. 

For the dequeue side, multicast connections have 
independent 32 tokens per port, each worth up 50-bit data or a 

20 complete packet. The head connection and its port mask of a higher 
priority queue is read out from the connection FIFO and the port 
mask memory every cycle ( 125MHz ) . A complete packet (or 50 bits if 
the packet — irs — longer than 50 bits) — is isolated from the 200 bit 
multicast cache line based on the length field of the head 

25 connection. The head packet is sent to all its destination ports. 
The 8 queue drainers transmit the packet to the separators when 
there are non-zero multicast tokens are available for the ports. 
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Next head connection will be processed only when the current head 
packet is sent out to all its ports. 

-For — the worst — case — analysis, — tree — the — 4Q0G — switch — as — arr 
example where the shortest packet is 7 bit — long . — Every Ons — cycle 
5 only one connection can be handled (bottlenecked by the connection 
FIFO and port mask memory) . — If the multicast only goes to 1 port, 
the effective dequeue throughput for the multicast connection is 
075Mbps out of available — lDGbps — shared memory dequeue bandwidth, 
i.e., — 6%-. In other words, — the multicast performance — is — severely 

10 damaged by the bottlenecks existing in th e connection FIFO, — port 
mask memory, and head - of -line blocking. The throughput for the 4Q0G 
switch is 400 + 7 + n/00-n + 42G where n is number of copies a multicast 

connection destined . In the worst case where n-1, — the multicast 

throughput — irs — about — 9% — available switch capacity. — If the average 

15 multicast connections make 11 copies, — the switch can achieve 4Q0G 
throughput . — 

Wte — longer — a packet — irs — (for — H°re — 240G — switch or — smaller 

configurations) , — the more ports — a multicast — connection destined, 
t+re — dequeue — performance — becomes — better — significantly. — Multicast 
2 0 performance — do — not — intervene — Hte — dequeue — speedup — for — unicast 
connections since the latter has their own tokens and two types of 
connections share the dout_me bus alternatively in a strict round- 
robin fashion, — i.e., the multicast connections do not block unicast 
ones . 



25 



There are 1D2 unicast queues, — 4 multicast queues, — and 4 

control port queues. — 4 multicast queues are per priority based and 
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can broadcast to any subset of 192 output ports and the 4 control 
ports . 

There — a-re — trp — t-o — 3r96 — destination — channels — (192 — blade 

channels and 4 control ports) — for the 400G switch. Each destination 
5 has — a — one to one — mapped — unicast — queue . — 4 — multicast — queues — cran 
broadcast to any subsets of 192 regular ports indicated by the per 
connection based port mask entry. — An OC 192 port uses one out of 
4 queue locations. Other three queues are unused. All 0 bit fabric 
queue — ID field on the — DestID bus — i-s — used to — identify one — of — 1-9-6 
10 ports . — 2 bit priority field is unused. 

For the 240G switch, Up to 100 destination channels exist 

-f-96 — blade — channels — and — 4 — control — ports) . — 96 — unicast — destination 

queues — have — 2 — priority — queues — each . 4 — multicast — queues — eran 

broadcast — to any subsets — of — 9-6 — ports — indicated by the per con 

15 nection based port mask entry. An OC-192 port uses one out of 4 

queue locations . — Other three queues are unused. — Lower 7 bit queue 
ID is used to identify one of 100 ports and lower 1 - bit of priority 
field is used to identify one of two priority queues in each port. 
Other queue ID bit and priority bit is unused. — 

2 0 For the 1G0G switch, Up to GO destination channels exist 

i~64 — blade — channels — and — 4 — control — ports) . — 64 — unicast — destination 
queues — have — 2 priority queues each . — There are — 66 — unused queues — 4- 
multicast queues can broadcast to any subsets of GO ports indicated 
by the per — connection based port mask entry. — An OC 192 port uses 

25 one — otrfe — of — 4 — queue — locations . — Other — three — queues — aire — unused. — 
Lower 7 bit queue ID is used to identify one of 100 ports and lower 
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1 ■ bit — of priority — field — rs — used to identify one — of — two priority 
queues in each port. Other queue ID bit and priority bit is unused. 

For — the — 120G — or — smaller — switch, — Bp — to — &2 — destination 

channels exist — (40 blade channels and 4 control ports) . — 40 unicast 
5 destination queues have 4 priority queues each. — 4 multicast queues 

can broadcast — to — arty — subsets — of — 4-8 — ports — indicated by — the — per 

connection based port mask entry. An OC 192 port uses one out of 

■4 — queue — locations . — Other — three — queues — are — unused. Lower — G bit 

queue — I-B — is used to — identify one of — 52 ports — and 2 - bit — priority 
10 field is — used to — identify one of 4 — priority queues — in each port . 
Other queue ID bits are unused. 

Queue structure can be changed on fly through the fabric 
resync cell where the number of priority per port field is used to 
indicate how many priority queues each port has. 

15 J ? L he — stripper A3IC — resides — on the network blade . — ft — has 
following features : 

•* Support packet/cell interfaces. — Can accept up to 3 GD/see of 

sustained traffic — (3.2 GD/sec in bursts) — of cells, — frames, — err 
a mix of cell and frame traffic. 
20 Generates fabric routeword for all fabrics in the switch 

Calculates data for the parity fabric and adds checks um to tns 

end of each pack e t. 

Support switch configuration: 400,000,1200,1000, 2400, and 400G 

Generates — appropriate — signals — to — interface — directly — to the 

25 transmit side of the Gbit transceivers. 
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The Gtriper takes DID cell/packet format from the ingress 

port ASIC. For the ATM interface, — the AGX cell format is accepted 
from the Vortex AGIC of the — Poseidon chipset at 2 . DGbps — for the 
channelized blade . — ft — consists — of — 4 byte — route word, — 4 byte ATM 
5 cell — header — (without — IIEC byte) , — and — 40 byte — payload. — 3G - bit — the 
switch — route — word — ean — be — generated — based — on — t+re — A-S^ — route — word 
provided by the Vortex AGIC. 

Wre — Striper — AGIC — consists — erf — three — major — blocks : — the 

switch — route — word — generator, the — switch — payload — 6 — checksum 

10 generator, — and the switch parity generator. 

Wre — switch payload — generator — forwards — 4 byte — A J ¥M — cell 

head, — 4 0 - byte — A¥W — cell — payload — and — 2 byte — checksum to — ttp — to — 1-2- 
switch fabrics and 1 — spare fabric. — The cell bus is 2x 12-bit wide 
running at 125MIIz. 

15 The Gtriper AGIC duplicates the packet/cell and transmits 

various — fragments — to — the — fabrics . — 1-2 — data — output — buses — of — the 
striper — AGICs — are — connected — to — the — data — input — buses — erf — th-e- 
aggregator AGICs on the fabrics as follows : 

Figure 14 — shows strip AGIC architecture. 

2 0 TABLE 20 : Data bu3 conn e ctivity of th e Gtriper AGIC of blad e #1 



[DOUT_ST_l_ 


40G (1 fabric) 


30G (2 fabrics) 


I20G (3 fabrics) 


160G (4 fabrics) 


M0G (fi fabrics) 


4fl0G (12 fabi in) 




DIN_AG_l_l_chJ 
calif 11.01 


DIM AG 1 1 eh 1 


DIN AG 1 1 eh 1 


DIN AG 1 1 eh 1 


DIN AG 1 1 chj 


DIN AG 1 1 eh 1 


;5.0] ctlipi.0] 


[3:0]' cclipi.O] 


[2.0] tcll[11.9] 


;i.0] cdl[ll.l0] 


[0] etll[ll] 






DIN AG 2 1 eh 1 
5.01 eelUS.QI 


DIN AG 2 1 eh 1 
•3;01-ecllf7;41 


DIN AG 2 1 eh 1 
"2.01 -eeUrO.61 


DIN AG 2 1 tit,! 
1.01 eeliro.01 


DIN AG 2 1 Ui 1 
01 edUlOl 






tm 


DIN AG 3 1 eh 1 


DIN AG 3 1 tli 1 
2:01-e e lirs.31 


DIN AG 3 1 th_] 
1 .01 nt\\\liO] 


DIN AG 3 1 ill 1 
01 eellfp] 
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* 


* 


DlN„AG_4_l_cli_l 


DIH_AG_4,1_l1iJ 


DIH_AO_<_1_ih_l 
















DIN_AG.5_l_cli_l 


DIN_AG_5_l_chJ 


B«r#6 






**> 








DIM_A0_C_1_Ui_l 


DIN,AG_C_l^nli,l 
0] iill[G1 














DIN AG 7 I eh 1 


















DIN_AG_0_1_eh_l 
™eellf4] 














DIN AG 0 1 eh 1 


-ee*S 














DIN AG 10 1 ah 


1 ccllf31 




tfo 










DIN AG 11 1 th 


bffi- ( 
















DIN AG 13 1 eh 
1-eellfO] 


















DIN AG jp 1 ah 1 
aarityriliOT 


DIN AG ap 1 eh 
If 5:01- purityffrO] 


DIN AG ap 1 eh 
ip.0] MmitjP.Ol 


DIN AG jp 1 eh 
1[2:01- pari«y[2iQ1 


DIN AG jp 1 eh 
U1.01 parity! 1.0] 


DIN AG 1 th 



¥fre — striper — ASICs — cm — blade — 4Hr— 1 s c© nnected — with 

aggregator — ASIC — fri — of — a-ti — switch — fabrics . — The — striper — ASICs — em 
blade — #2 — i-s — connected — with — aggregator — ASIC — #2 — of — a-H: — switch 
fabrics. The striper ASICs on blade # 4 is connected with aggregator 

15 ASIC #4 of all switch fabrics. The striper ASICs on blade #5 to #0 
are connected with aggregator ASIC #5 to # 0 of all switch fabrics , 
respectively. The striper ASICs on blade #41 to # 40 ~aire co nnected 
with aggregator ASIC #5 to # 0 of all switch fabrics, — respectively. 
In other words, — blade number moduled by 0 — is the aggregator ASIC 

2 0 number which a striper ASIC is connected to. — 



The parity bits are sent to the spare fabric. The purpose 

of the — spare — fabric — irs — to provide — fault tolerance ability to the 
switch, — i.e., — in case one of the switch fabrics failed, — the spare 
fabric recovers the lost part of the cell. This is achieved through 
2 5 a — parity — bit — gen e rator — cm — the — striper — ASIC . — Per — erne — fabric 
configuration, — the 12 bit cell payload is duplicated to the spare 
fabric; for 2 fabric configuration, G bit parity bits are generated 
as follows: 
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parity bit (1 : G) cell bit(l:G) exclusive OR cell bit(7.12), 

Pen: 3 - fabric — configuration, 4 bit — parity — bits — a-re 

generated as — follows : 

parity — bit (1 : 4) = — cell — bit (1 : 4) exclusive OR — cell — bit (5 : 0) 

5 exclusive OR ( 0 12) ; 

Fhe — route — word — generator — regenerates — the — switch — route 

word and sends up to 12 1 1 1 - bit 250MIIz route word buses for fabric 
1,2,3, . , — 12 and the spare fabric. 

The aggregator ASIC resides on the switch fabric as shown 

10 in the following figure. — Each 40G switch fabric has Oil aggregator 
ASICs . — It aggregates Gx4 separate cell streams and route words into 
a single 12G stream from up to G blades and 4 channels. All input 
signals from the network blades are 250MIIz point - to 1 point II3TL. — ft- 
outputs a single cell stream that is multiplexed with cell payload 
15 and route words to 12 memory controllers. — The A3IC has following 
features : 

■+ 12Gbps Data and route word input from up to G network blades 

and 4 channels 

** Route word separation and aggregation 

2 0 Output 12G data and route word to 12 memory controller ASICs 

I1STL interface with the memory controller, — receiver interface 

for the backplane gigabit transceivers. 



Figure ID shows aggregator ASIC architectur e . 



/ 
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The aggregator AGIC supports 40G, OOG, 120G, 1G0G, 240G, 

•artd — 400G — switch — configuration — without — backplane — change . ¥he 

backplane connectivity (DIN_AG buses) of a pair of aggregator AGICs 
is shown as follows : 



5 TADLE 21: DIN AG bus connectivity of aggregator ASIC #1 and #5 of switch fabric #1 
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ii i cciiniioi 


DOUT 0T_l_eh_l[ 


DOUT GT 1 eh M 


POUT GT 1 eh l[ 


DOUT_GT_l_eh_l[ 


POUT GT 1 eh 1[ 
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3.0] ceUpi.O] 


3.0] celipi.P] 


1.0] e.ell[l 1-10] 


0]-ccl1[M] 
















DIN_AGJ_fi_chJ[S:0] 


DOUT GT i eh l[ 
5.01 ttlUll.Ol 


POUT GT 5 eh 1[ 
3.0] eeliri 1.01 


POUT GT 5 eh 11 
5.0] ecliril:Ql 


DOUT GT 5 eh l[ 
1:0] eclipi.10] 


POUT GT 5 eh 1[ 
01 celipil 
















DIN_AG_l_l_eh_2 


POUT GT 9 eh 11 


POUT GT 9 eh 1[ 


DOUT GT 9 eh 1[ 


POUT GT 9 eh 1[ 


3.0] eelipi tO] 


3:0]-eell[ll:Q] 


1:0] eclipiilO] 


3] eell[ll] 
















DIN_AG_l_5_chJ>[2i 0] 


DOUT_0T_l3_eh_ 


DOUT_ST_13_eh_ 


POUT GT 13 eh 


U2.0] eeliphP] 


1 Tl .01 eelUl 1.101 


iroi-«iiriii 














POUT GT 17 eh 


DIN_AG_l_1_eh_3 


DOUT_0T_l 7_eh. 
lriiOI-eelirililOl 


no] edirm 
















PIN_AG_1_5_ch_3 


DOUT_GT_21_eh. 
U1.0] eeliril.10] 


POUT_GT_21_ch_ 

i roi ceiirm 
















DlN_AGJJ_eh_4 


DOUT GT 25 eh 

uoi eeiuin 














DOUT GT 39 eh 


DIN_AG_l_5_eh_4 


iroi-eeiirm 














DOUT GT 33 eh 


DIN_AG_M_eh_5 


IfO] celipil 
















DIN_AG_l_S_ehJ 


DOUT_0T_37_ch_ 
1 T0 1 eelipi] 
















DiN_AG_l_l_eh_6 


DOUT GT 41 eh 
1M eelipi] 
















PIN_AGJ_5_eh_6 


DOUT GT 45 eh 
1M eelipi] 



20 The 2 -a G DIN_AG buses of aggregator AGIC # 1 and #5 pair 

trf — switch — fabric — #i — irs — connected — to — t-he — 1-2 — x — DOUT_GT bus — JHt — erf- 
blade — Htt — &t — ^7 — 3^7 — — 2+7 — &n — &h — — — ^ — ?md — A$r 
respectively ■ — The 2 x G DIN_AG buses of aggregator AGIC #2 and #0 
pair of switch fabric #1 is connected to the 12 x DOUT_GT bus # 1 of 

25 blade — fr^ &7 — 3^7 3*7 — — £67 — 36-? &h ^7 *&7 — smd — *W 

respectively . The 2 x G DIN_AG buses of aggregator AGIC # 3 and #7 
pair of switch fabric #1 is conn e cted to the 12 x DOUT_GT bus #1 of 
blade — fr3-? — ^7 — 1*7 — 3-&7 — — &n — — 3*7 — *&7 — 3^7 — — «rd — ¥=hr 
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respectively. The 2 x G DIN_AG buses of aggregator AOIC # 4 and #0 
pair of switch fabric #1 is connected to the 12 x DOUT_GT bus # 1 of 
blade #4, — 8-? — &r t — tt- f — 2^ — &h — £#7 — &r, — 3^ — 4^7 — 4*7 — and — 4frr 

respectively . 

5 Likewise, — the 2 x — G DIN_AG buses of aggregator AOIC #1 

and # 5 pair of switch fabric # 2 — is connected to the 12 x DOUT 3T 
bus #2 of blade #1, 5, 0, 13, 17, 21, 25, 29, 33, 37, 41, and 45, 
respectively. — The 2 x G DIN_AG buses of aggr e gator ASIC #1 and #5 
pair of switch fabric #12 is connected to the 12 x DOUT_GT bus #12 
10 of blade #1, — &t — $1 — ^7 — ^7 — — 2-&t — &h — — 9^7 — 4+7 — and 4 5, 
respectively, — for the 400G switch configuration. — 

L E L hns — above — connectivity — i-s — repeated — 4 — times — for — the- 

channelized blades. 

Pot b*re 4^er &&zr, 120G, 1G0G, 240G, smd 4-free 

configuration, — each blade channel — sends — 12 x 3G bit — cell payload 
and 3G — bit route word, — G x 3G bit payload and 3G bit route word, 
4 x 3G bit payload and 3G bit route word, — 3 x 3G - bit payload and 
3G bit route word, — 2 x 3G - bit payload and 3G bit route word, — and 1 
x — 3G bit — payload — and — 3G - bit — route — word — -bo — each — switch — fabric, 
respectively. — frr — other — words, — t-hre — whole — 12 bit — wide — cell — i-s- 
transmitted in the same fabric for the 4 0G switch while only a — 9r— 
bit wide — (1/12 cell) — cell slice is transmitted on each fabric for 
the 400G switch. 

The GO - bit D0UT_AG bus is split onto 12 memory controller 

25 ASICs, — each receiving 5 - bit data and 1 bit clock signal — from one 
aggregator — AOIC . — ¥he — 15 bit — Dest ID — btrs — m — broadcast — to — a-fi — 1-2- 
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memory controllers . — Due to the fan out — load concern, — 3 — copies of 
the signals are maintained, — each driving 4 ASIC loads. 

Every channel of the aggregator sends up to 12x3x200 bit 

cell/packet — stream — bo — 1-2 — memory — controller — based — on — a — work 
5 conserving round ' robin dequeue algorithm, — i.e., — next source takes 
over — if the — current — source — runs — ©trfe — of eligible — cells/packets — t-o 
send. — Strict — round-robin algorithm is used among 24 — sources . — For 
the 40G switch, — only 4 source channels exist. — A source is eligible 
to send a cell/packet whenever a full cell or a full short packet 
10 or a — 12x3x200 bit segment of a long packet is received. — 

Each memory controller ASIC receives 9 independent cell 

streams from 9 aggregator AGICs. There are 9 250MIIz DIN_ME_fb_se 
buses, — each consisting of a 5 bit data bus, — a 1 bit clock signal, 
and a 15 bit DestID bus. — The G0-bit D0UT_AG data buses of all 9 

15 aggregator AOICs are bit sliced onto — 12 memory controllers, — each 
receiving 5 bit data from one DOUT_AG bus. — Every memory controller 
gets a separate non sharing clock signal — (named clkl to clkl2) — from 
each D0UT_AG bus to reduce the load of the clock pin while 3 memory 
controllers share a set of DestID bus from the DOUT_AG bus. — The 9 

2 0 DIN ME fb se buses — of memory conti r o ller — fri — a-re — connected — to — the 
DOUT_AG buses of 0 aggregators as follows: — 

DIN_ME_fb_l_l_data DOUT_AG_fb_l_data [40, 3G, 24, 12, 0] 

DIN_ME_fb_l_l_de5l ^ D0UT_AG_fb_l_de3tl 

DIN_MC_fb_l_l_elk D0UT_AG_f b_l_d kl 

2 5 - DIN_ME_fb_l_2_data D0UT_AG_fb_2_data [40, 3G, 24, 12, 0] 

DIN ME fb 1 2 dest - DOUT AG fb 2 deatl 
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DIN_ME_f b_l_2_el k — DOUT_AG_f b_2_cl kl 

DIM_ME_fb_l_3_data - D0UT_AG_fb_3_data [ 40 , 3G, 24 , 12 , 0 ] 

DIN_ME_fb_l_3_de3t - DOUT_AG_fb_3_de.s tl 

DIN_ME_fb_l_3_clk ^ DOUT_AG_fb_3_clkl 

5 DIN_ME_fb_l_4_data - DOUT_AG_f b_4_data [40, 3G, 24, 12,0] 

DIN_ME_fb_l_4_de3t - DOUT_AG_f b_4_de.stl 

DIN_ME_fb_l_4_clk - DOUT_AG_f b_4_clkl 

DIN_ME_fb_l_D_data DOUT_AG_f b_0_dat a [40,30,24,12,0] 

DIN_ME_fb_l_D_dest DOUT_AG_fb_0_de.5t 1 

10 DIN_ME_fb_l_0_clk DOUT_AG_f b_D_cl kl 

DIN_ME_fb_l_G_data - DOUT_AG_fb_G_data [40 , 3G, 24 , 12, 0 ] 

DIH_ME_fb_l_G_d e 3t DOUT_AG_fb_G_de.3tl 

DIN_ME_f b_l_G_elk DOUT_AG_f b_G_cl kl 

DIM_ME_fb_l_7_data DOUT_AG_f b_7_data [40, 3G, 24, 12,0] 

15 DIN_ME_fb_l_7_de5t - DOUT_AG_f b_7_destl 

DIN_ME_fb_l_7_clk DOUT_AG_f b_7_cl kl 

DIN_ME_fb_l_0_data - DOUT_AG_f b_0_data [40, 3G, 24, 12,0] 

DIM_ME_fb_l_0_de,st - DOUT_AG_f b_0_destl 

DIH_ME_£b_l_0_elk DOUT_AG_fb_0_elkl 

20 DIN_ME_fb_l_9_data - DOUT_AG_fb_Q_data [40, 3G, 24, 12, 0] 

DIH_ME_fb_l_9_de3t DOUT_AG_f b_Q_de.3tl 

DIM_ME_fb_l_9_clk DOUT_AG_f b_9_cl kl 

¥fr& — DIN_ME — data — buses — of — m e mory — eontL - ollei; — #-2 — a-re- 

connected to bit 4 0,37,25,13, and 1 of the DOUT AG data bu3es of 9 
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aggregators, — and so on. The DIN_ME data buses of memory controller 
#12 are connected to bit 00,47,35,23, and 11 of the DOUT_AG data 
buses of 9 aggregators. 

12 memory controller AOICs aggregate cell/packet streams 

5 from 0 i 1 — aggregator AOICs . — Then write — fe-he — cells — into one of 200 
output — queues — (e.g., — 1-2 — network blades — x — 4 — channelized — Poseidon 
interfaces x 4 priorities for unicast — f — 4 priorities for multicast 

h — 4 control port queues) . The 0-bit destination queue number on 

thre — DestID — fotrs — is — used — a-s — the — output — queue — indicator — for — the 
10 unicast — connection . — She — multicast — cell — i-s — stored — into — one — of — 4- 
priority queues based on the 2 - bit priority on the DestID bus. — The 
1G bit multicast connection number on the DestID bus will be used 
to lookup the internal port mask memory to find out the destination 
blade and channels during the dequeue phase. — 

15 The memory controllers send out cell/packet traffic from 

2-8-8 — output — queu e s — to — 8-Ht — separator AOICs . — Dequeuing — speed — is — ens- 
twice fast as enqueuing speed to reduce amount of cells buffered on 
the switch fabric. 

Oupport both variable length packet switching and fixed length 

2 0 cell switching 

-■ 12 AOICs are bit sliced and function as an integrated shared 

memory controller 

Support 4-86-; &Q€r, 120G, 1G0G, 240G, and 400G switch 

configurations 
2 5 • Enqueue cells/packets from 9 aggregator AOICs 
2x d e queue speedup to 9 separator AOICs 
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On chip APS support 

234, 057 cells on-chip buffer — 

■ 200 programmable destinati on queue s 

On-chip control port support 

5 G4K multicast connections, — 2 A 32 unicast connections. 

Per - queue transmit and loss counts 

Figure 1G shows memory controller ASIC architecture. 

A — 0Kxl3 bit — link — list — irs — used — tx> — maintain — free/used 

memory entry list pointer. A free entry is requested from the free 
10 link list when writing data into the shared memory and the current 
tail — cache — line — runs — otrfc — of — space . — Complete — cell/packet — will — be- 
dropped whenever the free list is empty, — i.e., — the shared memory is 

full . — A memory entry is free to the free list after the memory 

word — is transmitted to the separator ASICs. 

15 Figure 3r? shows wide cache line shared memory 

architecture . 

DIN_ME_fb_se_9 — and — D0UT_ME_f b_se_9 — buses — are — used — try 

connect to aggregator — # 9 and separator # 9, — which communicate with 
the control port striper, and unstriper ASICs only. — It has the same 
2 0 DestID and cell format as other 0 buses do. — Its cells are enqueued 
and dequeued in the same way as the regular cells. — 

There are up to 4 — additional control port queues. — They 

have queue ID from 192 to 195. All unicast connections having the 
control port queue ID as its fabric queue ID is enqueued into the 
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relative — control — port — queue . — There — are — at — most — 4 — OC 12 — control 
ports supported. 

Each — control — port — queue — has — a — 13 bit — control — port 

register as follows : 

5 TABLE 22 : — 13 bit Control port qu e u e r e gist e r 



Bit 12:5 


Bit-* 






lilt 1 


LIU V 








Control Port 2 enable 


Control Port 1 enable 




8-bit regular port ID 


Regular Port enable 


Control Port 3 enable 






Control Port 0 enable 



& — queu e — can — be — multicast — t-o — trp — t-o — 4 — physical — control 

ports — and — ©rre — regular — queue . — When — a — queue — irs — redirected — to — the 

10 regular — queue , — that — queue must be disabled for — the — regular — queue 
traffic . — Packets are queued in the same way as the regular queues 
— i.e., — 200 bit — cache — line — based. — Left — aligned — every — 3-6 — cache 
lines . — Strict — round-robin — among — 4 — queues — when — a — left alignment 
entry is transmitted. A queue is routed to 4 control ports and one 

15 regular port based on the 5 - bit control port enable vector. — 

Two dequeue algorithms are applied among 4 control port 

queues : 

- — a-) One control port only talks to one cp queue: — Pure round 

robin dequeue among 4 non empty control port queues which 
2 0 have non zero unicast tokens; one token worth unicast (up 

to 200 - bit) — is sent out to dout_me bus for a port; 

fer) One — control — port — talks — to multicast — cp queues: — Strict 

priority — among — 4 — control — port — queues; — queue — 3r9-2 — has- 
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high e st priority and queue 195 has lowest; — switch queues 
when the end of the packet is seen. 

OAM cells are identified by the Fabric queue ID field. If 

this field of a unicast connection has value OxFx(h), — then it is an 
5 OAM cell . — All OAM cells can be mapped into one of the 192 blade or 
4 control port queues set by a 0 bit programmable register — (called 
OAM cell destination register) . 

Resync cell (OxFF) or any other special cells with fabric 

queue ID set to OxFx are routed to any one of 19G queues based on 
10 the OAM cell destination register too. — 

Per destination minimum and maximum thresholds and counts 

can be — se^fe — up to help memory management . — 200x2x14 - bit — thresholds 

(in unit of 200 bit entry) and 200 x 13 - bit running counters — f-irr 

unit — erf — 200 bit — entry) are — provided. Two — additional — per 

15 destination — transmit — emd — loss — counts — (32 bit — each, — am — unit — erf 
packets) are also maintained. If the running count of a destination 
is above the relative threshold, — new packets are rejected and loss 
count increments . — Whenever dropping, — the whole packet is dropped. 
Otherwise, the transmit count increments . For multicast 

2 0 connections , — cells can also be rejected due to the multicast route 

word FIFO is full. 4 additional FIFO full counts are needed. If a 

packet — rs — dropped, — the — whole — packet — drs — cleaned — from — the — memory 
(including — the — segments — erf — a — long — packet) . — She — thresholds — arrd 
current counts are in unit of 200 bit cache lines. 



25 



"Fhe — minimum — threshold — (13 bit — valu e — plus — 1 - bit — enable 

bit - ) — is used to prevent shared memory starvation, — i.e., — every queue 
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reserves — at — least — the — number — erf — cache — lines — indicated — by — the- 
threshold. — The maximum threshold — (13 bit value plus — 1 bit enable 
bit) — is used to prevent any single queue consuming the whole shared 
memory . — These two thresholds cannot be changed unless there are no 
5 packets in the queues. 

ft±i — counters — are — 32 "bit — wide . — They — srre — reset — to — zero 

automatically after reading. — Their values stick to OxFFFFFFFF if 
overflowed. — It takes 2 A 32 x Ons — - 32 seconds to overflow a counter 
in the worst case. 

10 The value of any threshold registers can be updated on - 
fly by a resync cell or a shadow control cell. — The content of the 
3-2 — bit shadow data register is copied to the location pointed by 
the shadow address register. 

The memory controller — can enqueue a single OC-192 — data 

15 stream from the aggregator AOIC and dequeue a single OC 192 data 

stream to the separator AOIC instead of 4x00-40 — streams . — At the 

ingress — side, — the AGIC receives — 4 — continuous cells /packets /cache 
lines — from — the — same — source — channel — instead — erf — 4 — channels . — Mtsr 
special treatment is needed. — 

2 0 At the egress side, the Qu e ue Drainer reads 4 cache lines 

from the shar e d memory for one destination after a token command is 
received for the OC 192 port. — The ROD can send up to 4 200 bit 
cache lines to the separator from th e same destination queue. — Each 
OC 192 port has 4 priorities for all switch configurations. 
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The separator AGICs receive cell/pack e t streams from 12 

memory controllers, — separate, and send them up to 40 network blades 
through — the backplanes. — 54°re — interfaces between — t-h-e — separator — and- 
the backplane are 250MIIz point to point IIGTL signals. 

5 Figure 10 shows the Separator ABIC architecture. 

Receive 12 data streams from 12 memory controllers 
Fabric synchronization 

-* 24 - destination — (blades and channels) — addressing 

Route word separation and aggregation 

10 0.2Dum 3V CMOO technology 

410 I/O pins 

- — : — 140 bit 200MHz input; 240 bit 250MIIz output (at most 120 of 
them switch simultaneously) ; — 30 bit control signals 

Wre — separator — ha-s — twice — number — of — data — output — pins — « 

15 that of the aggregator AGIC to support 2X speedup. Similar to those 
of the striper ABIC, the AOIC supports 40G, 00G, 120G, 1G0G, 240G, 
and 400G switch configurations without backplane change. 

"Pfre — separator — AGIC — performs — reverse — function — erf — t-he 

aggregator — AGIC. — S4re — AGIC — receives — 120 bit — 250MIIz — cell/packet 
2 0 stream — from — one — erf — 8 — D0UT_ME_f b_se_bu — buses — of — every — memory 
controller — (12 of them) . — 10 bit blade and channel selection signals 
are used to select one of 24 destinations inside each separator for 
up to two cells. For exampl e , the DIN_GF buses of separator AGIC #1 
is connected as follows : — 
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CII_0P_fb_l — CII_ME_fb_l 

When a valid cell/packet — (channel ID is in the range of 

15 0 - 23) — w — received, — the — packet — type — field — irr — the — route — word — w- 
checked — first . — HE — i± — t\r$ — an ATM — cell, — no — packet — length — field — » 
followed. The length of cell payload is 3Gxl2/number of fabrics. — 3Hr 
it is a packet, — the packet length bit immediately followed is used 
"to — indicate — htrrw — long — a packet — length — irs-. — 0-12 - bit — packet — length 
2 0 (including — this — bit) — artd — 1-24 - bit — packet — length — (including — this 
bit) . — The entire packet/cell is routed to the destination channel 
indicated by the chann e l — fBi — The invalid channel — 3rB — (bigger than 
•2-4-) — is used to indicate that the cell/packet is invalid. — 




The AGIC — then — separat e — the — route — word and the — payload 

2 5 onto the route word bus and the data bus of one of — G blades and 4 
destination — channel s/un striper — AOICs — based — cm — the — channel — £B 




-89- 

signals. One 200MIIz 24 bit data bus yields GGbps data bandwidth for 
each channel. — Each route word is 2 bit wide running at 200MHz. — 

T+re — connectivity — between — H°re — separator — ASICs — and — the 

Unstriper AOICs are symmetric to those between the aggregator AOICs 
5 and the — striper AOICs. — 54re — only difference — rs — that — sriri — data — smd- 
route word pins have double -width to achieve 2X speedup. 

Data — received — from — each — destination — erf — each — memory 

controller — ha-s — a — 1 - bit — valid — bi* — accompanied. There — srre — zHf- 

destination — input — FIFOs — are — used — to — store — t-he — 1-2 — pieces — of- 
10 cell/packets — from 12 memory controllers for 24 — destination blade 
■srrd — channels — in — each — separator, — respectively. — When — ai± — 12 — cell 
segments arrives, — the complete cell is sent to the relative output 
FIFO indicated by the channel ID. 

Like the striper ASIC, a 3 - bit sequence number counter is 

15 maintained for the backplane synchronization. — It increments every 
3G 250MHz cycles. — When a cell is sent to the unstriper AOICs via 
the backplane, — the current counter — is attached into the sequence 
number field in the 3G -' bit route word. — 

54°re — sequence — number — counter — irs — reset — by — t+re — global 

2 0 resynchronization logic . 



The unstriper AOIC — takes — GGbps — traffic — from up to — 12 I 1 

switch — fabrics . — ft — then — unstripes — the — cell — and — send — arfe — to — the 
egress netmod AOIC at SGbps or lower — speed. 
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Receive GGbps route word and data from up to 12 i 1 fabrics at 

250MIIz — ftrr — QC40 — err — combine — 4 — chips — to — support — 2-9 — Gbps 
routeword and data from up to 12 I 1 fabrics for OC192c 

* Error — check — data — transport — throughout — the — switch, — detect 
5 corrupted data and perform data recovery 

* Reconstructs cells/packets from the individual switch fabrics. 
Send 04 bit 100MHz data to the egress port AGIC for OC40, 250 

bit for 0ClD2c 

* Supports both UC and MC connection context — for fabric data. 

10 Figure 19 shows the unstriper AGIC Architecture. 

Ffre — unstriper — AGIC — receives — cells — from — trp — to — 12 I 1 

fabrics, — each — running — srfc — 250MIIz . — ft — uses — the — following — steps — to 
reconstruct good data. 

ir-. — ffl: — incoming — routewords — srre — compared. — ££ — any — one — routeword 
15 disagrees , — that — data — lane — rs — flagged — a-s — being — ±rr — error . — If more 
than one routeword disagrees, — the data is dropped. 

2. All valid input lanes are put through reconstruction logic which 
will attempt to build n i l candidate output data streams — for an N 
fabric switch. Any data lane which is not valid will invalidate any 
2 0 data lane which uses that data. 



th — £rt-k — valid — reconstruction — lanes — will — check — the — &ft€ — of — t-he- 
received data and one passing output is selected. 
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The striper remaps the separate routeword and data buses - 

to a combined outgoing routeword — I data bus. 

The following will detail the steps which happen at power 

trp — from — art — architectural — perspective . — Note — that — when — expanding 
5 switch — capacity 7 — the — additional — fabrics — must — be — brought — on -line 
before any new port cards are brought on - line. 

Fabric Initialization 

ir~. Port — cards — (unstripers ) — aire — initialized — "bo — only — look — srb 

current fabric capacity and ignore other fabric inputs. 

9n Fabric is inserted, asserts its board present signal. Stripers 

start sending routewords to the new fabrics, — though they are 
ignored at this point. 

3- : Doard — irs — reset, M€H? — starts — fe-o — boot — the — board . Def ore 

proceeding to the next step, — the MCF/OCP establish commumc a— 
tion via the e net network. 

4- . If the board is fabric 0 or the parity fabric, — the sync pulse 

transmitter is initialized. — (Actually sync pulse transmitter 
can be initialized on all fabrics, but it is only connected to 
DP signals if it is fabric 0 or the parity fabric.) 

2 0 MP — initializes — sync — registers — rrt — fc+re — aggregator, — memory 

controller, — and separator, — then initializes the registers in 
t+re — sync pulse — receiver . — Wre — sync pulse — receiver — starts — bo 
look for a valid sync pulse. — The last sync setup is the sync 
pulse receiver, — so that all receivers on the chips are ready 

25 for the — sync pulse — from the sync pulse receiver. — Wte — fabric 

chips run chip chip sync on the next backplane sync pulse. The 
MP should check to make sure the fabric has synchronized. — ££■ 
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sync — has — rrot — been — achieved; — reset — the — fabric — chips — and — re 
execute step 4 - 

GCF tells MP the current switch capacity window to use. — This 

is actually going to correspond to the current switch capacity 
5 (does — rret — count — the — capacity — erf — the — new — fabric — ±-£ — switch 
capacity is being expanded) . 

MP — initializes — the — backplane — transceiver — networks — with — the 

current switch capacity — (both send and receive) — and initial 
izes — a5r± — registers — except — the — aggregator — input — enables . — Any 

10 values — used — f-err — configurable — options (which — ports — are 

OC4Q/OC192 , — memory thresholds , — etc) — need to be communicated 
and — initialized — at — this — point . — Certain — r e gisters — are — ini- 
tialized based on — the — switch board slot, — which needs — to be 
known at this point. — From a software perspective, — the biggest 
15 register — set — which must — be — done — drs — te« — update — the — port — mask 
table in the memory controllers to match the port mask table 
from another switch fabric - 

£h Aggregator — input — enables — are — set — for — the — current — switch 

capacity. — This — will — start — enqueueing traffic on this — switch 
2 0 board. The aggregators will need to see a bus idle followed by 

an increment in the transmit sequence number before starting 
to actually receive data. 

-9-. OCT sends a queue resync cell. — On cell return, — fabric queues 

are now synchronized. However, no valid data is being enqueued 
2 5 in the new fabric (s) and the fabric outputs are being ignored. 

•Hh All unstripers must be configured to start utilizing the new 

fabric . — Since — queues — have — been — resynchronized, — the — fabric 
dequeuing should be synchronized and no errors should be seen. 
If errors ar e seen, — clear them, — return to step 0 . 
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-Hr: — After — srirt — unstripers — have — been — updated, — S€-P — tells — all port 
card MCFs to update stripe amount inside each of the striper 
ASICs . — ¥he — change — dm — striper — configuration — will — start — t-he- 
switch utilizing the additional capacity. 

5 -1-2-: After — a-ii — stripe — amounts — srre — updated — smd — traffic — from the 

previous — stripe — amount — drained — from — the — switch, — then — the- 
switch capacity needs to be updated. The only fixed time bound 
way — erf — ensure — traffic — from — the — previous — stripe — amount — i-s- 
f lushed is to execute a queue resync. — If not all traffic has 
10 been flushed from the system with the previous stripe amount, 

t-he — switch — will — drop — this — traffic — at — the — unstripers — (since 
there is no synchronization of the update at the separators, 
the drop cannot be performed there) . 

Def ore — a — port — card — m — brought — on - line, — any — necessary 

15 switch — fabrics — must — be — brought — on line — first . — fts — per — the — switch 
standard convention, — port card installation happens in order. 

i-a-: — ¥he — starting — state has — sufficient — switch capacity to — support 
the new port card. — Aggregators are currently configured to ignore 
the input — from any new board. 

2 0 ib-. — Port — card — irs — inserted — emd — asserts — rtrs — board present — signal . 
Tort card sees sync pattern received from the fabrics. 

9n — Wte — sync — pulse — receiver — irs — initialized. — ¥he — port — card — starts 
looking for a valid sync pulse on the backplane. 



-4-: — Otriper — transmitter — i-s — set — xrp — fot — the — appropriate — number — erf* 
2 5 destination — fabrics — and th e — Gbit — network control — « — initialized . 
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Defore the GDit networks are initialized, — the fabrics cannot count 

on seeing idle data from the new port card. At this point, — the 

port card can communicate its type — (OC4Q/OC192 ) — to the fabrics. 

5-a-: — Fabrics configure the port card type and enable the input from 
5 the port card. 

5fen — Stripe r/unst riper — aire — rrow — initialized, — along — with — t+re — other 
chips on the board. — Gome enable in the inbound data path should be 
disabled. — The DID input — enable — in the — striper — can be used or — some 
other board specific input enable. 

-6-: — After — both — 5a — and — 5b — have — been — completed, — t+re — port — card — c^rrr 
enable its input side and start sending data to the fabrics. — Note 
that — i-n — general, — further — software — configuration will — need — t-e — be 
done after this point — (such as setting up inbound lookup entries) . 
I E L fre — completion of — 5a — is — necessary to ensure — the — fabric queues — dc 
not go out of sync. 

— First data from the port card is striped to all — fabrics . 

0. When a port card is removed from the system, — not very much needs 

to happen from a hardware perspective. Def ore the port card goes 

away, — it transmits a packet abort which will cause any incomplete 
2 0 packets in the egress side to the dropped. — Traffic will be drained 
from — t-hre — memory — queues — which — correspond — tt5 — bhe — affected — output 
ports . 

-9-: c Po — remove — a — port — card — from — t+re — switch — logically, — software 

should disable the striper output bus. 
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Fabric deactivation is — similar to — fabric activation — irr 

reverse . The steps include: 

irz — Gwitch capacity is being removed. — If port cards are present in 
the switch which are paired with the fabric capacity which is about 
5 to be removed, — those must first be deactivated. 

9-. Program the remaining stripers in the system to stripe data to 

one less stripe amount than the current configuration. This will 

stop sending real data to the fabric about to be decommissioned. 

■3-: Oend a queue resynch. This will flush out any traffic at the 

10 last stripe amount. 

4-. Program the — unstripers — — start — ignoring — t+re — data — from the 

fabric which is about to be removed. 

■5-: The fabric can now be physically removed from the system, — enr 

logically — removed — from — ttre — system — by — disabling — arts — inputs — and 
15 outputs . 

The reason for the queue resynch step is not because the 

switch — w — otrfe — of — sync. The unstriper — will — treat — the — receipt — of- 

traffic which is striped to more fabrics than physically present in 

t+re — switch — as — em — error — smd — increment — error — counts . Wte — queue 

2 0 resynch ensures that the error counts on the unstripers will not 
increment unnecessarily. 
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— Flush out traffic from the port to be — converted over to APG . 
Initialize anything in the separator as required for the new output 
port combination. 

9r. — Write to the APG enable bit using the shadow register in every 
5 memory controller for the output port being affected. The main port 
for APG is not affected. — Either a higher or lower number port can 
be the primary port and the backup port. — APG is always enabled on 
the backup port . 

3-i — Gend either a queue resync cell or a shadow control cell to all 
10 memory controllers. 

4. Memory controllers start to dequeue after the next left aligned 
cache boundary — (if the previous transfer — for this port was — left 
aligned, — it will be remembered) . 

Note that in all this process, — the queue number was never switched. 
15 I fhe — switch — will — rr&t — support — a — seamless — port — swap — dtre — to — A-PS- 
activate/deactivate . — (In other words, — APG can be turned on port 0, 
which will cause port 0 to mirror port 1G. — However, — APG cannot be 
turned off — on port — 3r6 — since — i± — i-s — rrcrfe — om — Traffic — i-s — only being 
changed for the port where APG is added.) 

20 The following words have reasonably specific meanings in 

the vocabulary of the switch. Many are mentioned elsewhere, but 
this is an attempt to bring them together in one place with 
definitions . 
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TABLE 23: 



Word Meaning 

APS Automatic Protection Switching. A sonet/sdh standard for implementing redundancy on physical links. 

For the switch, APS is used to also recover from any detected port card failures. 
Backplane A generic term referring either to the general process the the switch boards use to account for varying transport 
5 synch delays between boards and clock drift or to the logic which implements the TX/RX functionality required for 

the the switch ASICs to account for varying transport delays and clock drifts. 
BIB The switch input bus. The bus which is used to pass data to the striper(s). See also BOB 

Blade Another term used for a port card. References to blades should have been eliminated from this document, but 

some may persist. 

BOB The switch output bus. The output bus from the striper which connects to the egress memory controller. See 

also BIB. 

Egress This is the routeword which is supplied to the chip after the unstriper. From an internal chipset perspective, 

Routeword tne egress routeword is treated as data. See also fabric routeword. 

Fabric Routeword used by the fabric to determine the output queue. This routeword is not passed outside the 

Routeword unstriper. A significant portion of this routeword is blown away in the fabrics. 
Freeze Having logic maintain its values during lock-down cycles. 



10 



Lock-down Period of time where the fabric effectively stops performing any work to compensate for clock drift. If the 
backplane synchronization logic determines that a fabric is 8 clock cycles fast, the fabric will lock down for 
8 clocks. 



1 5 Queue Resynch A queue resynch is a series of steps executed to ensure that the logical state of all fabric queues for all ports is 
identical at one logical point in time. Queue resynch is not tied to backplane resynch (including lock- down) 
in any fashion, except that a lock-down can occur during a queue resynch. 

Striped input bus. A largely obsolete term used to describe the output bus from the striper and input bus to the 
aggregator. 

One of two meanings. The first is striped output bus, which is the output bus of the fabric and the input bus 
of the agg. See also SIB. The second meaning is a generic term used to describe engineers who left Marconi 
to form/work for a start-up after starting the switch design. 

Depends heavily on context. Related terms are queue resynch, lock-down, freeze, and backplane sync. 
The implicit bit steering which occurs in the OC192 ingress stage since data is bit interleaved among stripers. 
This bit steering is reversed by the aggregators. 



SIB 



SOB 



Sync 
Wacking 



2 0 £ H c re — Aggregator — Receive — Synchronizer 1 s — function — irs — ttr 

maintain logical cell/packet ordering across ert± fabrics . 

Cells/packets arriving at more than one fabric from different port 
cards — need to be processed — in the — same — logical — order — across — ait 
fabrics . If cell/packet logical ordering is not maintained, — then 

2 5 cells/packets — coming — otrfe — erf — fabrics — will — have — stripes — of — a- 
particular cell/packet not match up and will not be able to be re - 
assembled by the Unstriper. 
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Logical — cell/packet — ordering — needs — to — be — maintained 

across the following conditions : 

Transport — delay — variances — between — one — source — and — multiple 
destinations 

5 Clock drift across transmitters and receivers 

Ins e rtion and removal of port cards — and fabrics 

. Fort card errors such as no sync, — no lock downs , — too fast/too 
slow,, — routeword parity errors 

Gigabit transceiver errors such as loss of lock, — data errors 

10 * Non synchronized updates to Gigabit network 

OC192c — data — streams (aggregating — 4 — channels — fe© — make — ttp — one 

OC192c stream) 

54re — switch — uses — a — system — of — transmit — and — receive 

counters . Wre — counters — allow — a-t-£ — components — in — t+ne — system — to 

15 logically — align — themselves . I Pfre — Master — Sequence — Generator 

implements these two counters that will count continuously from '0' 
t-o — 1 tF — and will increment every x 125 MHz clock cycles where, — x is 

the counter tick length as programmed by software. x is currently 

calculated to be 250 cycles. — This is based on analysis done in the 

2 0 Dackplane — Oynchronization — ADS . Wre — r e lationship — between — the 

transmit — and — receive — counters — can — be — seen — in — Figure — 2-G-: 9ne 

counter will be used by the transmit synchronizers in the Striper 
■arrd — Separator — ASICs — and — t+re — oth e r — counter — will — be — used — in — the 
receive synchronizers in the Aggregator and Unstriper ASICs. 54te 

25 receive counter will be a delayed version of the transmit counter. 
The amount — of delay — rs — programmed by — software — in the — Sync — Pulse 
Receiv e — Delay — regist e r : This — r e gister — det e rmines — the — number — of- 
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clock cycles that the receive count e r waits before incrementing its 
own counter relative to the transmit counter. — This register should 
always be non zero since the transmitter will have no delay and the 
receiver needs to be delayed with respect to the transmitter. — Wte- 
5 Gync Pulse Receive Delay has been estimated to be 150 cycles. — I Fhe- 
delay — drs — approximated — equal — to — the — worst — case — transport — delay 
between transmitter and receiver plus worst case transport delay 

variance — of the — sync pulse. She — delay also takes — into — account 

worst case fast and slow transmitters and receivers. 

10 The Gync Pulse Period is defined as the number of cycles 

between sync pulses. — It is extended slightly by about 10 cycles in 
order — forr — rt — to — appear — late — in — the — H?- 2 — window — of — each — ASIC s 
sequence count. — This is done to ensure that every AGIC will appear 
tn3 — be — running — — fast — even — if — they — are — actually — running — slow 

15 relative to the clock that generated the sync pulse. If this was 

not — done, — the — sync — pulse — could appear — in — the — — window — and — the 

AGIC would consider itself to be slow. There would be no way for 

it to catch up. Each transmitter and receiver will calculate the 

difference between when — the — sync pulse — arrives — and when — rt-s — own 

2 0 counter transitions from to A 0' . — This difference is the number 
^rf — cycles — that — it — ars — fast — and — a-s — referred — to — a-s — the — lock down 
amount (z in figure) . — Once a transmitt e r determines it should lock - 
down for z cycles, — it will finish sending valid data during its A 0' 
window and then lock down z cycles. During the lock down period, 

2 5 no — valid — or — idle — data — ars — sent . Instead, — a — special — lock down — K 

character is transmitt e d which will be recognized by the receiver. 
The receiver will not write the lock down characters into its input 

FIFOs . This — will — ensure — that — the — input — FIFOs — can' t — overflow . 

Gince the sequence counter does not advanc e for the amount of lock - 
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down, — it is effectively resetting itself to the sync pulse. It is 

equivalent of having the sync pulse appear at the start of the — 3 -& x - 

count — window — since — the — transition — to — a — count — of — occurs 

precisely one tick length after the sync pulse arrives. When the 

5 next sync pulse arrives, — if clock frequencies are constant, — then 
t+re — sync — pulse — should — appear — in — the — — count — window — and — the 
calculated — lock down — amount — will — be — the — same — a-s — the — previous 

calculation . This — allows — the — system to — always — expect — the — sync 

pulse arrival in the '0' count window even if the clocks generating 
10 the sequence counter are too fast or too slow. 

EHre — Receive — Synchronizer — block — will — tree — the — sequence 

counter — to — determine — when — to — accept — data — from — input — byte — sync 

FIFOs . Once — a — sync — character — is — read, — pops — from the — FIFOs — will 

only occur — once — the — sequence — counter — transitions — from "0" — to — 

15 -arrd — immediately — following — art — arrival — of a — sync pulse. L E L he — read 

decision is only made once every sync pulse arrival and only at the 

— to — ^Hr" — transition — of — the — receive — sequence — counter . 54w 

sequence — counter — ins — also — used during — fabric — resync — in — order — t& 
communicate — a — fabric — resync — to — a-i-1 — channels — in — a-ti — aggregators 

2 0 during a — sequence count transition. Fabric resync cells will be 

transmitted — at — the — beginning — of — a — sequence — tick — window — and — a-re- 

prefixed — by — a — special — character — indicating — a — resync — cell . 'Phe- 

receive — synchronizers — in — the — Aggregators — will — resynchronize — a-Hr 
data — going to the memory controllers — on the — next — sequence — count 

2 5 transition once the resync character has been received. 

A block diagram of the receive Synchronizer can be seen 

in Figure 21. The Receive Synchronizer consists of 24 — Dyt e sync 

FIFOs, — a Crossbar and G Dus Synchronizers. — There is one byte sync 
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FIFO per gigabit receiver. Each byte sync FIFO will accept data 

from each gigabit receiver independent of the mode of the switch. 

The byte sync FIFO depth is about 25G words deep. This depth is 

based on a derivation found in th e Dackplane Synchronizer ADS . — "Phe 
5 Crossbar will handle the assignment of the appropriate input byte 
lanes to the correct channels. — Each Dus Synchronizer will consist 

of four Channel FIFOs and one Dus Controller. The Dus Controller 

ean — handle — 4 — separate — OC4 0 — channels — err — one — OC192c — stream. She 

channel — FIFO — ars — about — 1-8 — words — deep . SHre — depth — is — based on — the 

10 number of words to read a 30 bit routeword. — The whole routeword is 
read and then presented to the rest of the Aggregator in one cycle 
since it needs to be stored before the data of the packet as it is 
constructed and sent to the memory controller. 

Multiple gigabit receivers make up a 24 bit data bus and 

15 2 - bit routeword bus for one channel of an Aggregator. — Each gigabit 

receiver can handle up to 0 bits. Due to varying transport delays 

that — can exist — between receivers , — bytes — from different — receivers 

that belong to the — same word can be — skewed from each other. Ftrr 

example , — the — 24 bit — data — btrs — and — 2 - bit — routeword — hrrs — £-crr — one 
2 0 channel — erf — an — aggregator — will — have — 4 — receivers — that — make — trp — the 

bus . The synchronization logic will align all 4 bytes for the 2G 

bit bus — and will pass — this byte aligned word to the rest — of the 

Aggregator . In order to align the bytes, — th e Striper will need to 

send — a — special — alignment — byte — to — each — receiver . A — special — K 

25 character — can be — utiliz e d — from the — gigabit — transceivers . The K 

character — will — be — encoded — in — the — data — bits — on — the — Gigabit 
transmitter and will be detected on the Gigabit receiver. — 
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The receive synchronizer in the Aggregator will consist 

of 24 — FIFOs where there is one FIFO per Gigabit Receiver. These 

FIFOs will handle both byte — . alignment and the backplane 

synchronisation . It is assumed that the Gigabit Receivers will be 

5 able to distinguish between valid, — idle, — sync and lock down cycles 
and will indicate these various cycles to the Aggregator by using 
3 control — signals . 

On startup, — the FIFOs will be empty and each Write State 

Machine (WGM) will wait until a sync character is seen on its input. 
From this point on, every cycle will be pushed except for lock-down 

cycles — from — the — fabric . When — the — fabric — irs — locking — down, — the- 

Stripers w j_n send special lock "down characters- This is done to 

avoid overflowing the — sync — FIFOs — in case the write — side — clock is 

faster than the read side clock. While particular types of words 

are being pushed, the word type will also be written to the FIFO so 
it can be distinguished on the read side. 

The WGM is also looking for a special fabric resync cell 

K character that will indicate that a fabric queue resync cell will 

immediately follow . If a resync cell is detected, — a resync signal 

20 is passed along to Dus — Controller . The Dus — Controller will — then 

tell other Aggregators on the fabric to resync their queues at the 

next transition of the sequence counter. Fabric queue resync is 

described in more detail later. 

Gigabit receivers are not dedicated to particular input 

2 5 channels , — but instead shared b e tween various channels. Each byte 

sync — FIFO works — independently of the switch mode and each — input 
lane needs to be steered to th e correct channel FIFO. — For instance 
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in 4 0 mode, — 20 bits of data and routeword are required for Dus 1, 
channel A and therefore 4 byte lanes are required to be steered to 
each channel of Dus 1. — In 00/120 mode, — only 0 bits of data and 2 
bits — erf — routeword — are — required — and — therefore — two — bytes — will 
5 suffice . — In 400 mode, only 4 bits are required per channel and one 

byte — lane will — suffice . fts — switch — capacity — increases, — less — and- 

less byte lanes will be required for a particular channel. — For all 
switch — modes, — the — routeword — bits — ftrr — a — particular — channel — will 
always come from the same byte lane. — As the byte lanes get reduced 

10 from 4 to 1 byte lanes, — there will always be one common byte lane 
used to carry the routeword data lines. — The crossbar will take in 
9r4 — lanes consisting of 0 bits of data and 3 bits of control along 
with — other — control — signals — to — communicate — with — the — Btrs — Control 
logic . It will then forward all these signals to the appropriate 

15 channels . The Crossbar will also accept control data from the Dus 

Controller and forward signals such as read requests and FIFO flush 

signals — to the — appropriate — input byte — sync — FIFOs . Each crossbar 

mapping between input byte lanes and channels is bi directional. 

The Dus Controller consists of three state machines. — 54°re 

2 0 state machines control the read side of the byte sync FIFOs, — the 
write side of the channel FIFOs and the read side of the Channel 

FIFOs . On the read side of the Dyte FIFOs, pops will not commence 

until a sync pulse has arrived and the receive sequence counter has 

transitioned from "0" to "1". A signal will be provided from the 

25 sequence generator block that indicates a "0" to "1" transition at 

precisely — this — moment ( sync_e vent) . At — this time, the Dus 

Controller — issues — a — read — to — the — Crossbar — f-crr — the — particular 

channel . "Phe — Crossbar — then — forwards — the — read — signal — to — the 

appropriate byte sync FIFOs based on the mode of the switch. The 
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Crossbar then forwards all data and control from these byte sync 

FIFOs — back — fc© — the — Btrs — Controller — ftrt — this — channel . Wre — Btw 

Controller checks the data types to make sure that the first word 

in the appropriate byte sync FIFOs are a sync character. If the 

5 first word of any of the appropriate byte lanes for this channel is 

mt — a — sync — character , then — a — sync — error — will — be — f lagged, 

appropriate byte sync FIFOs will be flushed and the synchronization 

process — will — be — re initiated. ff — the — first — word — irs — a — sync 

character, — then pops — will — continue . In OC40 — mode, — this — process 

10 will be performed independently for each channel. OC192c support 

is discussed later on. 

Once data starts being read from byte sync FIFOs, the Dus 

Controller — will — ignore — data — until — rt — finds — the — first — idle — word . 
Once an idle word has been found, — it can now start looking for the 
£r&P — indication — in — the — routeword — when — the — next — non idle — word — ts- 
read. — The rest of the routeword is processed and made available to 

t-he — rest — erf — the — Aggregator . ff — the — stop — bit — irt — the — routeword 

indicates — that — the — packet — irs — continuing, then — data — will — be 

continuously — made ■ available — to — t-he — Aggregator — until — a — stop 

indication is read. Note that even though a GOT is seen, — it does 

not mean that this — segment — is the — first — segment — of a packet. ft- 

can be any segment of a packet. — Even though the segment may not be 
the first one of a packet, — it is allowed to go through the switch 
and will be dropped later on. 

2 5 When a sync character is read, a counter is initialised. 

The counter — counts each read from the byte sync FIFOs. The Dus 

Controller — will — expect — to — see — a — sync character — every — sync pulse 
period (about 22,000 cycles). — If a sync character is read too early 
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GIT tOO late, — then a sync error is flagged, — data is dropped at the 

precise — logical cycle of where a sync character — is expected . ft 

packet that is being processed at the theoretical logical cycle for 
sync — will — be — terminated — and — inputs — will — be — disabled — until — re— 

5 enabled by 0/W. For example, — if after the first sync character, 

t+re — next — sync — character — occurs — arfe — cycle — 19, 000, — and then — a — sync 
error is flagged. — Data is not dropped until 22, 000 reads have been 
performed . — Also, — if after the first sync character, — the next sync 
character is not received at all after 22,000 cycles, — then a sync 

10 error is flagged and data is dropped at this precise logical cycle. 
If a sync character is received precisely 22,000 cycles after the 
last one, — then reads from the byte sync FIFOs are stopped until the 
receive sequence counter transitions from y 0 r — to — Hr*-: — Waiting for 
t+re — HiH — to — — transition — will — ensure — that — crti — fabrics — are- 

15 receiving the same stripe of a packet on the same logical cycle. 

For OC192c, 4 input channels need to be concatenated into 

one OC192c stream. In this mode, — the Dus Controller will control 

all 4 channel FIFOs and the appropriate byte sync FIFOs. — Data type 
checking will be performed across 4 times as many byte lanes as in 

2 0 the OC40 — case . When it is time to read byte sync FIFOs, — the Dus 

Controller will control 4 read control lin e s to the Crossbar. The 

Crossbar will initiate reads across all appropriate byte sync FIFOs 
that are required for OC192c and will present data back to the Dus 
Controller . — The Dus Controller will check data types and will look 

25 for GOP indications. — The OOP indication and stop bits will only be 

found — rrr — t-he — Routeword — for — chann e l — ft-: ¥he — Btrs — Controller — will 

write all — 4 — channel — FIFOs at the same time when writing data and 
will present the complete OC192c Routeword in one cycle to the rest 
of the Aggregator. Wre — functions — of the — Btrs — Controller will be 
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identieal for OC40 and OC192c except that all 4 channel FIFOs will 
be controlled when in OC192c mode. 

Special — cases — e^m — be — broken — down — into — the — following 

categories : 

5 Port card insertion 



•iH Port card removal 

9n Port card errors including : 

£n No sync character 

Eh Port card not locking down 

10 6h Routeword parity errors 

Eh Garbage data 

Eh Port card sending data too fast or too slow 

.3- Fabric Queue resync 

4-: Non synchronized updates to Gigabit network 

15 When — a — port — card — m — inserted, — the — port — card — present 



signal will be — asserted and sent to each — fabric . Wot — until — S/W 

enables the particular inputs and the Aggregator sees the port card 
present — signal, — will the Aggregator be ready to accept data from 
the new port card. Once enabled, — the Aggregator will go through 

2 0 the process of looking for sync characters on individual byte lanes 

associated with the new port card. ft — is assumed that the port 

card will not send any data until it has been configured only after 

t+te — fabrics — have — been — initialized. Once — the — port — cards — are- 

enabled, — they will — start — sending — sync — characters — periodically at 

25 every — global — sync — pulse — arrival . ft — ±-s — important — that — ctkt — the- 

appropriate fabrics see the sync character from the particular port 
card — since — some — fabrics — will — be — initialized — later — than — others . 
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After sync charact e rs have been received, — all data will be written 
on each cycle excluding lock down characters - 

When — a — port — card — ins — about — to — be — removed, — the — enable 

switch on the port card will be turned off. This will signal the 

5 port card to finish sending valid packets and then send idles. — The 
port card will send a packet abort k character to indicate that no 
more — valid — packets — will — be — sent — immediately — following — t-he — last 

valid packet. It is assumed that when the port card is actually 

removed, — it will have already sent the packet abort — k character. 

10 This is critical for the fabrics to keep their queues in sync. ft- 

is important that each Aggregator on each fabric that handles the 

particular port card stops forwarding data tro t-he — memory 

controllers at precisely the same logical cycle. — The WGM will stop 
writing — data — into — the — byte — sync — FIFOs — once — the — packet — abort 

15 character — ins — seen . The Dus — Controller will — terminate — the packet 

once the packet abort character is read out of the byte sync FIFOs. 

Case A : — No sync/early sync/late sync from port card. 

Solution : — The Synchronizer will — look for a — sync at precisely the 

same — logical — cycle — each time. This will — occur — every — sync pulse 

2 0 period — that — i-s — approximately — 22 , 000 — 125MIIz — cycles . — t-he — sync 

character — i-s — not present — at the head of the byte — sync — FIFOs when 
22,000 cycles have been read since the last sync character, — a sync 
error will be flagged and data will be dropped the cycle where the 
sync character should have been. — All fabrics need to drop data at 

2 5 precisely the same logical cycle for this particular — input — lane . 
Inputs for this particular channel will be turned off and the byte 

sync FIFOs used for this channel will be flushed. 0/W will turn 

of-£ — the — offending — Striper. Inputs — will — be — ignored — until — SYW 
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enables these inputs again. — If a sync character arrives too early, 
then data should be dropped at precisely the cycle where the early 
sync was read. — Other Aggregators will make the same drop decision 

i-f — this — error — is — common — tro — a-H: — fabrics . f-f — t-he — sync — character 

5 arrives too late or not at all, then the drop decision will be made 

where — tire — sync — character — was — expected . ¥he — sync — character — ts- 

expected to arrive every 22,000 cycles after the last sync. 

Case D : — Fort card not locking down. 

Solution : If the port card does not lock down, — it will then send 

10 more than the ideal number of valid and idle cycles between sync 
characters . — This will be caught by the same logic that checks for 

sync — characters — in — the — correct — logical — cycles . Data — will — be 

dropped the — same way as — in the — case where no — sync came — from the 
port card. 

15 Case C: — Routeword parity errors. 

Solution : — If a parity error is detected for a particular routeword, 
the packet will be terminated at the bad segment and a parity error 

will — be — flagged. Data — will — be — dropped — after — this — terminated 

segment — ars — forwarded to the rest of the Aggregator and FIFOs — fcrr 

2 0 this particular channel will be flushed. Inputs will be disabled 

until re-enabled by Q/W, 

Case — B-: — Garbage data — from port — card while all — fabrics — already in 
sync . 

Solution : — If the data is unrecognizable by the gigabit receivers, 
2 5 errors will be formed and provided to the Aggregator by the gigabit 

receivers . ft± — the point — of — error, — data — being written — into byte 

sync FIFOs will be flagged to be in error. If the Dus Controller 
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sees — that — the particular — byte — lane — in error — is — rrcfe — used for — t+re- 
Routeword bits, then the error will be flagged but the data will be 

passed on to downstream logic. This — is considered to be a soft 

failure — since queues will — still be able — bo — stay in — sync . If the 

5 Dus Controller sees that the particular byte lane in error is used 
for the Routeword bits, — then the packet will be terminated and then 

dropped once the erred word is read from the byte sync FIFO. 

input will be disabled, a gigabit receiver error will be flagged to 
D/W and byte sync and channel FIFOs associated with this channel 

10 will be flushed. — This is considered to be a hard failure. If the 

failure occurs only for one fabric, then other fabrics can still be 

used to re assemble the packets. — 0/W will have to queue resync the 

bad fabric. — If this error occurs across multiple fabrics, not much 
can be — done to avoid fabric queues — from becoming corrupted. S-/W 

15 will then have to queue resync all fabrics. 

Case — E-: — Port — card — sending — data — tw — fast — err — txro — slow. ft — ±sr 

possible that the port card is sending the correct number of valid 
cycles between sync characters but is not locking down enough or 

locking — down — to© — much — during — each — lock " down — period. Dyte — sync 

2 0 FIFOs can eventually overflow or underflow respectively. If more 

than one — fabric have FIFOs that overflow or underflow and data is 
dropped — srfc — different — logical — cycles — for — the — same — source, — then 
fabric queues can become out of sync. 

Solution : — This — irs — considered a — hard — failure — since — irfe — should not 

2 5 occur — irf — t-he — hardware — i-s — working — correctly . 54°re — only — wa-y — bo- 

possibly prevent — this — i-s — to — flag an error — if the — FIFOs — reach an 
almost full or almost empty threshold. — This is a warning sign that 

something — i-s — wrong . G/W will — th e n — turn — crfrf — the — offending — port 

card. Data will continue to be written to and read from the byte 
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sync FIFOs as if nothing is wrong. If the port card can be turned 

off and idles be sent before byte sync FIFOs overflow, — then there 

will be no dropped data and fabric queues will — stay in sync. 3HE- 

FIFOs overflow or underflow for a particular channel, — then a FIFO 

5 overflow /under flow — error — will — be — flagged. ¥he — packet — being 

processed — by — the — synchronizer — stt — the — time — erf — error — will — be 

terminated. All data will be dropped from this point on. Inputs 

for this channel will be disabled until re enabled by Q/W. FIFOs 

for this channel will be flushed. 

10 Fabric queue resync is performed am order t-o 

resynchronize memory controller queues. It is important that all 

fabrics — are processing the — stripe of the — same — cell — err — packet — ert- 
precisely the same logical cycle and that all — fabrics are acting 
together as one logical fabric. Fabric queue resync starts at the 

15 Stripers . The Striper will receive a queue resync cell — from the 

control port. The striper will decode the queue resync cell and 

will — back — trp — traffic — until — the — next — sequence — counter — tick — ts 

reached . £± — this — point, — rt — will — send — a — fabric — queue — resync — K 

character — immediately — followed by the queue — resync — cell . A-t — the 

2 0 fabric, — the WOM in the receive synchronizer will receive the queue 
resync — K — character — arrd — notify — the — Btrs — Controller — in — the — receive 
synchronizer that a queue resync cell is in the input FIFO and that 
the queue resync event should occur at the next transition of the 
receive sequence counter. — The Dus Controller will then indicate to 

2 5 other Aggregators on the fabric that a resync cell event will take 

place — rt — the — next — transition — of — the — sequence — counter . I Phe- 

indication is asserted about 10 cycles before the receive sequence 

counter transitions . This is done to allow enough time for other 

Aggregators to see this assertion before their respective receive 
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sequence — counters — 


transition — also . Once — the — sequence — count 


transition — occurs , 


the — Aggregators — will — signal — to — the — memory 


controllers that a 
event — delimits — old 


queue resync event has occurred and that this 




and new data . All data sent before the — sync 




is considered new data. The memory controllers synchronize their 







the switch as a regular cell and returned to the control port. 

There can be times when the gigabit network is changing 

its operating mode and the — switch is — changing from a — 4 0/00 — to an 

0 0/120 — mode — for — example . There — i-s — rro — guarantee — that — Gigabit 

Receivers will be driven by Gigabit Transmitters during this time 

period. Aggregators — that — expect — good data — from certain — Gigabit 

Receivers may not get good data. If the switch is increasing its 

mode, — then a previously unused FIFO will now be used. If this FIFO 

has garbage data on its inputs, then syncs will not be received and 
this — FIFO will not be synced until the gigabit network is — stable . 
Once the Gigabit network is stable, — idles and sync characters will 
be transmitted by the port — cards — and the — FIFOs — will have — enough 

time — to — sync — trp-: — the — switch — i-s — decreasing — rtrs — mode, — then 

previously used FIFOs will now be unused. — The Aggregator will know 
the — new — switch capacity and will — eventually ignore — th e se — channel 
FIFOs . 

"iPhe — Unstriper — needs — to — provide — back pressure — to — the- 

2 5 Separators when internal FIFOs in the Unstriper become near full. 
Each Separator will expect 24 separate back - pr e ssure signals coming 
from all — the port — card channels — it — i-s — connected to . The back - 
pressure signal is considered to be asynchronous to all ASICs. ft- 




-112- 

is required that all relevant Separators receive back pressure from 
a particular channel in the Unstriper at precisely the same logical 

cycle . This — i-s — clone by having — the — Unstripers — assert — the — back 

pressure — signal when their receive sequence counter transitions. 
5 It is assumed that the Unstriper' s receive sequence counter is a 

delayed version of the Stripers transmit sequence counter. Since 

the tick length is 250 cycles and the receive counter is delayed by 
150 cycle relative to the transmit counter, there exists 100 cycles 
of margin to transport the back pressure signal from the Unstriper 

10 to the Separator . The Separator needs about 10 cycles before the 

transition — erf — rtns — sequence — counter — to — sample — the — back pressure 
signal . This will give the Separator enough time to provide back - 
pressure to the memory controller before the counter transitions. 
This places a maximum requirement on the propagation delay of the 

15 back - pressure signal . The following requirements hold true: 

Dack pressure — propagation — delay — z — counter — tick — length receive 

sync pulse delay setup time of Separator' — sample point 

Dack pressure propagation delay < 250 1-5-6 1-8- 

Dack pressure propagation delay — < 90 cycles @ — 125 MHz or 720 ns 

2 0 Assuming worst case conditions, — the expected worst case 

propagation delay would be: 

Dack pressur e propagation delay - — (Unstriper to Striper delay) — H 
(Striper to Aggregator delay) — I — Aggregator to Separator Delay 
Dack pressure propagation delay - 5 cycles — (chip and board delay) 
25 h — ( 5 i G2 cycles) (chip and port card to fabric delay of 500 ns) — i — & 
cycles — (chip and board delay) 
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Dack pressure propagation delay - 77 cycles < 90 cycles 

£rs erem — foe seen from — this estimat e , t-he maximum 

back - pressure propagation delay requirement is met. 

Assuming edrt t+re relevant Separators receive the 

5 back pressure — signal — before — the — transition — fce — the — next — sequence 
count, — then it can be synchronized to the n e xt transition of the 
transmit sequence counter. — This will allow all relevant Separators 
to stop sending valid data at precisely th e same logical cycle for 
tme — complete — counter — tick — interval . This — is — true — since — rt — is- 

10 assumed that — when the — transmit — sequence — counter — transitions, — 

data that the Separators are sending are companion fragments of the 

same packet. If back pressure — irs — sampled again before — the — next 

counter transition, — then data will be stopped for another counter 
tick interval. — This mechanism implies that back -pressure can only 

15 be generated on a counter — tick length granularity. 

Since — there — i-s — rro — direct — path — from — Unstriper — t-c 

Separator, — the back - pressure signals need to be re routed from the 
Unstriper, — t-o — the — Striper, — to — the Aggregator — arrd — finally to — the 
Separator . In order to do this, — each Unstriper needs to send the 

2 0 back - pressure — signal — to — the — corresponding — Striper — cm — that — port 

card. Wre — Striper — will — then — forward — the — back pressure — signal 

through — the — backplane — gigabit — transceivers — onto — the — Aggregator . 
The Aggregator will forward up to 24 separate back - pressure signals 
to one Separator corresponding to 0 buses with 4 channels per bus. 

2 5 J Phe — back pressure — signal — will — always — tree — fort — 9 — of — the — gigabit 

transceivers . c Phe — receive — synchronizer — block — rn — the Aggregator 

will forward the correct back pr e ssure signal for the appropriate 
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bus and channel to the Separator. — Since the gigabit receivers are 
not dedicated to any particular bus and channel, — the synchronizer 
needs to select the correct gigabit receiver bas e d on the switch 

configuration just — like — it — does — for — regular — data . Once — this — » 

5 done, — bit — 9 — erf — the — gigabit — receiver — is — forwarded on — a-s — the back - 
pressure — signal . Note — that — bit — 9 — irs — also — used — for — receiving — k 

characters and can change when sending a k character. In order to 

avoid mistakenly interpreting bit — 9 — of a — k character — as a valid 
back - pressure signal, — the synchronizer will only sample the back 

10 pressure bit when valid data is received from the gigabit receiver. 
In the case where a k charact e r is, received, — the. synchronizer will 
hold the back - pressure signal at its current value. — There is still 
a — case — where — the — Striper — can — be — sending — back - to back — idle 
characters since there is nothing to send. — If the Otriper needs to 

15 change the value of the back - pressure signal in this case, — then it 
will — send one — of — two — k characters — that — change — the — back- pressure 

value . The two k characters that will be used are a set and clear 

of — the — back pressure — signal . I-f — the — synchronizer — receives — a- 

back pressure — set — err — clear — character, — it — will — set — or — clear — the 

2 0 back - pressure — signal — respectively . ff — arry — other — k — character — » 

received, — the current back pressure signal is retained. If valid 

data — is — received, — bit — 6 — of — the — appropriate — gigabit — receiver — is- 
sampled as the back pressure signal. 

Although the invention has been described in detail in 
25 the foregoing embodiments for the purpose of illustration, it is to 
be understood that such detail is solely for that purpose and that 
variations can be made therein by those skilled in the art without 
departing from the spirit and scope of the invention except as it 
may be described by the following claims. 



WHAT IS CLAIMED IS: 



1. A switch for switching packets from a plurality of 
sources comprising : 

a memory in which portions of packets are stored; and 

a transferring mechanism which transfers predetermined 
portions of a packet to the memory as the predetermined portions 
are received. 

2. A switch as described in Claim 1 wherein the 
transferring mechanism transfers predetermined portions of the 
packet as fixed length segments as the fixed length segments are 
received followed by a single final segment of any length wherein 
the packet is transferred to the memory. 

3. A switch as described in Claim 2 wherein the 
transferring mechanism transfers fixed length segments of different 
packets interleaved among each other as they are received to the 
memory. 

4 . A switch as described in Claim 3 wherein the 
transferring mechanism includes an aggregator which receives 
portions of packets from the plurality of sources. 

5. A switch as described in Claim 4 wherein the memory 
includes a memory controller. 



6. A switch as described in Claim 5 wherein the 
aggregator uses TDM to multiplex segments of packets from different 
sources to the memory controller. 

7. A switch as described in Claim 6 wherein the 
aggregator places an identifier with each segment identifying from 
which source the segments came from. 

8. A switch as described in Claim 7 wherein the memory 
controller includes per source queues, and stores each segment in 
a corresponding per source queue based on the identifier of the 
segment . 

9. A switch as described in Claim 8 wherein the memory 
controller includes per destination queues, and once all segments 
for a packet are received at a per source queue, all the segments 
of the packet are changed to a corresponding per destination queue. 

10. A switch as described in Claim 9 wherein the memory 
controller has acceptance criteria for accepting segments, and if 
the segment is not accepted, then all previously received segments 
associated with the segment not accepted are purged from the per 
source queue and any segments associated with the segment not 
accepted that are received after the segment that was not accepted 
was received, are ignored. 

11. A switch as described in Claim 10 including a fabric 
in which the aggregator and the memory controller are disposed, and 
including a separator disposed in the fabric connected to the 
aggregator . 



12. A switch as described in Claim 11 including a port 
card having a striper which sends portions of packets to the 
aggregator, and an unstriper which receives portions of packets 
from the separator. 

13. A switch as described in Claim 12 wherein the memory 
controller includes a shared memory, and the destination queues and 
the source queues are part of the shared memory. 

14. A method for switching packets comprising the steps 

of: 

receiving portions of a packet at a transferring 
mechanism of a switch; and 

transferring predetermined portions of the packet to a 
memory of the switch as the predetermined portions are received at 
the transferring mechanism. 

15. A method as described in Claim 14 wherein the 
transferring step includes the step of transferring the 
predetermined portions as fixed length segments as the fixed length 
segments are received at the transferring mechanism followed by a 
single final segment of any length wherein the packet is 
transferred to the memory. 

16. A method as described in Claim 15 wherein the 
transferring step includes the step of transferring fixed length 
segments of different packets as they are received interleaved 
among each other to the memory. 
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17. A method as described in Claim 16 wherein the 
receiving step includes the step of receiving portions of packets 
from different sources at an aggregator of the transferring 
mechanism disposed in a fabric of the switch. 

18. A method as described in Claim 17 wherein the 
transferring step includes the step of multiplexing with the 
aggregator segments of packets from different sources to the memory 
controller . 

19. A method as described in Claim 18 wherein before the 
transferring step, there is the step of placing by the aggregator 
an identifier with each segment identifying from which source the 
segment came from. 

20. A method as described in Claim 19 wherein after the 
transferring step, there is the step of storing each segment in a 
corresponding per source queue of the memory controller based on 
the identifier of the segment. 

21. A method as described in Claim 20 including after 
the storing step, there is the step of changing all segments of the 
packet in the source queue to a corresponding per destination queue 
of the memory controller once all the segments of the packet are 
received at the per source queue. 

22. A method as described in Claim 21 wherein the 
receiving step includes the steps of purging all previously 
received segments associated with an unaccepted segment that does 
not meet acceptance criteria for accepting a segment of the memory 



controller, and ignoring all segments associated with the 
unaccepted segment received at the memory controller after the 
unaccepted segment is received at the memory controller. 

23. A method as described in Claim 22 wherein the 
receiving step includes the step of receiving portions of packets 
from different sources at the aggregator of the transferring 
mechanism disposed in the fabric of the switch from a striper of a 
port card of the switch. 

24. A method as described in Claim 23 including after 
the moving step, there is the step of sending portions of packets 
from the memory controller with a separator of the fabric to an 
unstriper of the port card. 
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ABSTRACT OF THE DISCLOSURE 

LONG PACKET HANDLING 

A switch for switching packets from a plurality of 
sources. The switch includes a memory in which portions of packets 
are stored. The switch includes a transferring mechanism which 
transfers predetermined portions of a packet to the memory as the 
predetermined portions are received. A method for switching 
packets. The method includes the steps of receiving portions of a 
packet at a transferring mechanism of a switch. Then there is the 
step of transferring predetermined portions of the packet to a 
memory of the switch as the predetermined portions are received at 
the transferring mechanism. 



